100% found this document useful (1 vote)
742 views53 pages

COA (Week 1 To Week 12 Detailed Solution)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
742 views53 pages

COA (Week 1 To Week 12 Detailed Solution)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

NPTEL Online Certification Course

Multi-Core Computer Architecture


Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati

ti
1. In a typical execution cycle, which one of the following sequences accurately depicts the

ha
steps involved?
a. Instruction fetch → Operand fetch → Instruction decode → Execute → Result store

wa
b. Instruction decode → Instruction fetch -> Operand fetch → Execute → Result store
c. Instruction fetch → Instruction decode → Operand fetch → Execute → Result store

Gu
d. Operand fetch → Instruction fetch → Instruction decode → Execute → Result store
In a processor's typical execution cycle, instructions follow a fetch-decode-execute pattern.
Initially, the processor fetches the next instruction from memory using the program counter.

gy
Afterward, the fetched instruction is decoded to identify its operation and operands.
Following this, the necessary operands are fetched from the register file or memory. With

lo
the operands in hand, the instruction is executed, carrying out the intended operation.
no
Finally, the resulting output is stored in the destination register or memory. This continuous
cycle enables efficient execution of instructions by the processor.
ch
2. Which one of the following is the feature of the Little Endian scheme?
Te

a. The least significant byte is stored at the smallest address


b. The most significant byte is stored at the smallest address
c. The least significant byte is stored at the largest address
of

d. The most significant byte is stored at any address.


In Little Endian byte ordering, the lower-order bytes of a multi-byte data type (e.g.,
te

integers, floating-point numbers) are stored at lower memory addresses, and the
itu

higher-order bytes are stored at higher memory addresses. This means that the least
significant byte is stored first in memory, at the smallest address, followed by the more
significant bytes in increasing order of significance.
st
In

3. In terms of processor memory interaction, what is the role of the Program Counter?
a. It contains the address of the instruction currently being executed
an

b. It contains the address of the next instruction to be fetched


c. It contains the data to be written into or read from memory
di

d. It contains the result of a computation


In

The Program Counter is a register that keeps track of the memory address of the next
instruction that the processor needs to fetch and execute. During the fetch phase of the
processor's instruction cycle, the PC is used to access the memory location containing the
next instruction. Once the instruction is fetched, the PC is updated to point to the address of
the subsequent instruction in memory, enabling the processor to continue executing
instructions sequentially.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati

ti
4. CISC architecture attempts to minimize the number of instructions per program but at the

ha
cost of:
a. Using a larger memory

wa
b. Using a wider bus to carry an instruction from memory
c. Decrease in the average number of cycles per instruction
d. Increase in the average number of cycles per instruction

Gu
In CISC (Complex Instruction Set Computer) architecture, complex and multi-step
instructions are used to reduce the number of instructions needed to perform a particular
task. However, these complex instructions often require multiple cycles to execute, leading

gy
to an increase in the average number of cycles per instruction. This can result in longer

lo
execution times for individual instructions compared to simpler architectures like RISC
(Reduced Instruction Set Computer).
no
5. A processor has 8 general-purpose registers. It uses a 24-bit instruction format. If the
ch
opcode field occupies 6 bits followed by the register field that stores a register address, how
many bits are left for other fields in the instruction?
Te

a. 14 bits
b. 8 bits
of

c. 10 bits
d. 15 bits
te

Solution:
Total bits used for opcode field + register field = 6 + 3(for 8 registers) = 9 bits
itu

Since the instruction format is 24 bits in total, there are 24 - 9 = 15 bits left for other fields.
st

6. Consider the following operations done on a Stack Machine Architecture:


In

Push D
Push C
an

Push B
Mult
di

Add
Pop A
In

If the values in memory locations B, C, and D, are 6, 2, and 4, respectively. What will be
stored in memory location A after the execution of the above program?
a. 14
b. 16
c. 26
d. 12
Push D: Pushes the value of D (4) onto the stack.
Stack: [4]
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
Push C: Pushes the value of C (2) onto the stack.
Stack: [4, 2]

wa
Push B: Pushes the value of B (6) onto the stack.
Stack: [4, 2, 6]
Mult: Multiplies the top two values on the stack (2 * 6) and pushes the result (12) onto the

Gu
stack.
Stack: [4, 12]
Add: Adds the top two values on the stack (4 + 12) and pushes the result (16) onto the

gy
stack.

lo
Stack: [16]
Pop A: Pops the top value (16) from the stack and stores it in memory location A.
no
After the execution of the program, the value 16 will be stored in memory location A.
ch
Te

7. A program has 30% Load instructions, 20% Store instructions, 40% ALU instructions, and
10% Branch instructions. On a processor each Load, Store, ALU and Branch instructions
take 4, 3, 1 and 2 cycles, respectively. What is the average CPI (Cycles Per Instruction) on
of

the processor for this program? [2 marks]


te

Given percentages and cycles per instruction:


itu

Percentage of Load instructions = 30%


Percentage of Store instructions = 20%
st
In

Percentage of ALU instructions = 40%


Percentage of Branch instructions = 10%
an

Cycles per Load = 4 cycles


di

Cycles per Store = 3 cycles


In

Cycles per ALU = 1 cycle


Cycles per Branch = 2 cycles
Average CPI = (Percentage of Load instructions * Cycles per Load) + (Percentage of Store
instructions * Cycles per Store) + (Percentage of ALU instructions * Cycles per ALU) +
(Percentage of Branch instructions * Cycles per Branch)
Average CPI = (0.30 * 4) + (0.20 * 3) + (0.40 * 1) + (0.10 * 2)
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati

ti
Average CPI = 1.2 + 0.6 + 0.4 + 0.2, Average CPI = 2.4

ha
8. A new Graphics Processing Unit (GPU) is added to a system, which speeds up the

wa
execution of graphics-related instructions by 6 times. If a program has 50% graphics-related
instructions, what is the overall speedup gained while running the program on the system
with the GPU compared to running it on the system without the GPU? [ 2 marks]

Gu
Given:

gy
Fraction enhanced = 50% = 0.5

lo
Speedup = 6

no
Overall speedup = 1 / [(1 - fraction enhanced) + (fraction enhanced / speedup)]
Overall speedup = 1 / [(1 - 0.5) + (0.5 / 6)]
ch
= 1 / [0.5 + 0.0833]
Te

= 1 / 0.5833
of

≈ 1.715
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
1. Which type of data hazard occurs when an instruction tries to read an operand before
another instruction writes it?

wa
a. RAW hazard
b. WAR hazard
c. WAW hazard

Gu
d. RAR hazard
The type of data hazard that occurs when an instruction tries to read an operand before
another instruction writes it is called a RAW hazard, which stands for "Read After Write"

gy
hazard. This situation arises when a dependent instruction (one that requires the result of a

lo
previous instruction) attempts to read data that has not yet been written by the preceding
instruction.
no
2. Which one of the following is true regarding pipelining in microprocessors?
ch
a. Pipelining reduces the latency of a single instruction.
b. Pipelining improves the throughput of a program.
Te

c. Pipelining increases the clock speed of the processor.


d. Pipelining reduces the execution time of an instruction.
of

Pipelining allows multiple instructions to be overlapped in execution, which can lead to a


higher overall throughput by increasing the number of instructions completed in a given
te

time period. While pipelining can reduce the latency of a single instruction (by dividing it
into stages and processing multiple instructions simultaneously), it does not directly
itu

increase the clock speed of the processor or reduce the execution time of an individual
instruction.
st

3. Which type of hazard occurs when different instructions, at different stages in the pipeline,
In

want to use the same hardware resource?


a. Data hazard
an

b. Control hazard
c. Structural hazard
di

d. Pipeline hazard
In

Structural hazards arise when there are not enough hardware resources (such as functional
units or memory ports) to accommodate the simultaneous execution of multiple
instructions in the pipeline. This can lead to delays and inefficiencies in the pipeline's
operation.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati

ti
4. Which one of the following is a characteristic feature of a typical RISC machine?

ha
a. Complex instructions with longer execution time.

wa
b. Memory-intensive instructions.
c. Load and store instructions.
d. Instructions with variable lengths.

Gu
RISC architectures are designed to have a small and optimized set of instructions, with a
focus on simple and efficient operations. In RISC architectures, instructions like

gy
arithmetic operations and logic operations are often performed directly on registers, and
memory operations are typically done using separate load and store instructions. This

lo
approach aims to simplify the instruction pipeline and improve overall performance by
reducing the complexity of individual instructions.
no
5. Which of the following statement is true with respect to an Instruction Fetch operation of a
ch
processor pipeline?
a. Contents from Instruction memory are transferred to ID/EX pipeline register
Te

b. Contents from Instruction memory are transferred to IF/ID pipeline register


c. Contents from IF/ID pipeline register are transferred to ID/EX pipeline register
of

d. Contents from ID/EX pipeline register are transferred to IF/ID pipeline register
During the Instruction Fetch stage of a pipeline, the next instruction is fetched from
te

memory and placed into the IF/ID pipeline register for further processing in subsequent
itu

pipeline stages.
st

6. Consider 3 instructions I1, I2 and I3 given in the order of execution.


I1: ADD R1, R2, R3
In

I2: SUB R4, R1, R2


an

I3: XOR R2, R5, R3


The dependency between R2 of I1 and R2 of I3 is known as …………
di

a. input dependence
In

b. anti dependence
c. output dependence
d. true data dependence
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
The dependency between R2 of instruction I1 and R2 of instruction I3 is known as an anti
dependence. In an anti dependence, a later instruction (I3) depends on the output of an

wa
earlier instruction (I1) in such a way that if the earlier instruction's result is modified before
it is used, incorrect results could occur. In this case, I3 depends on the value of R2 from I1,

Gu
and if I2 modifies R2 between I1 and I3, it could lead to incorrect behavior.

7. The technique of separating dependent instruction from the source instruction by pipeline

gy
latency of the source instruction is called--------.

lo
a. instruction folding
b. compiler scheduling
c. operand forwarding
no
ch
d. bypassing
Operand forwarding, also known as data forwarding or result forwarding, is a technique
Te

used in pipelined processors to avoid data hazards. It allows the result of an instruction to
be forwarded directly to a dependent instruction, even before the result is written to the
of

register file or memory. This helps in reducing stalls and improving pipeline efficiency by
allowing instructions to proceed without waiting for the results of previous instructions.
te
itu

8. Assume an instruction pipeline with 5 stages namely IF, ID, EX, MEM and WB with
individual latencies 50 ns, 30 ns, 70 ns, 85 ns, and 40 ns, respectively. Pipeline latch
st

latency is10 ns. What is the pipeline cycle time in ns?


Answer: 95 ns
In

Pipeline cycle time = Longest Stage Latency + Pipeline Latch Latency


Pipeline cycle time = 85 ns + 10 ns
an

Pipeline cycle time = 95 ns


di

So, the pipeline cycle time in ns is: 95


In

9. Given a non-pipelined architecture running at 1.5GHz, that takes 5 cycles to finish an


instruction. You want to make it pipelined with 5 stages. Due to hardware overhead, the
pipelined design will operate only at 1GHz. 10% of memory instructions cause a stall of 30
cycles, 30% of branch instructions cause a stall of 2 cycles, and load-ALU combinations
cause a stall of 1 cycle. Assume that in a given program, there exist 20% of branch
instructions and 30% of memory instructions. 10% of instructions are load-ALU
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati

ti
combinations. What is the speedup of the pipelined design over the non-pipelined design?

ha
Correct to 2 decimal places. [2 marks]
Answer: Range: 1.55 to 1.58

wa
In the non-pipelined architecture, the CPI is 5.
Execution time for 1 instruction: 5*(1/1.5) ns = 3.33 ns

Gu
In the pipelined architecture, the CPI is calculated as follows:
CPI_p = Base CPI + Stall CPI
= 1 + (0.3 * 0.1 * 30) + (0.2 * 0.3 * 2) + (0.1 * 1)

gy
= 1 + 0.9 + 0.12 + 0.1 = 2.12
Execution time for 1 instruction: 2.12*1 ns = 3.33 ns

lo
The speedup is calculated as: Speedup = ET_non-pipelined / ET_pipelined
= 3.33/2.12 = 1.58
Allowed Range: Range: 1.55 to 1.58
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati

ti
1. Which one of the following statements is/are TRUE?
I. A one-bit predictor changes the prediction value for each mis-prediction.

ha
II. BPB stores the previous outcomes of the branch instruction.
III. (p,q) branch predictor uses the outcome of last p branches to index into the BPB where

wa
each entry has a q-bit predictor.
IV. If BTB can store one or more target instructions it can facilitate branch folding.

Gu
a. III only
b. I and II only
c. IV only

gy
d. I, II, III and IV
I. True - A one-bit predictor changes its prediction value for each mis-prediction by flipping

lo
a single bit.
no
II. True- The BPB (Branch Prediction Buffer) stores previous branch prediction outcomes.
III. True - In a (p,q) branch predictor, the outcomes of the last p branches are used to index
ch
into the BPB, which contains q-bit predictors for making predictions.
IV. True- The BTB (Branch Target Buffer) predicts the target addresses of branches and
Te

facilitate branch folding, which is about optimizing code structure.


of

2. With respect to a MIPS multi-cycle floating point pipeline, which one of the following
statements is FALSE?
te

a. RAW dependency stalls can happen even after enabling operand forwarding.
b. Even after operand forwarding, there will be 3 stalls between a pair of adjacent
itu

FADD instructions that has a RAW dependency between them.


c. Initiation Intervals of FMUL unit is 1 cycle.
st

d. Even after operand forwarding, there will be 7 stalls between a pair of adjacent
FMUL instructions that has a RAW dependency between them.
In

In a multi-cycle floating-point pipeline, operand forwarding can reduce RAW


an

(Read-After-Write) dependency stalls by directly passing the required data from the
producing instruction to the dependent instruction. Therefore, the statement that suggests
di

there will be 3 stalls between adjacent FADD instructions with a RAW dependency, even
In

after operand forwarding, is not accurate. If operand forwarding is properly implemented, it


can eliminate these stalls, allowing the dependent instruction to receive the required data
from the producing instruction without waiting for it to be written back to the register file.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati

3. For filling the delay slot for a branch, an instruction is chosen from the target location of the

ti
branch if..........

ha
a. the outcome of the branch is irrelevant
b. the probability of branch not taken is very high
c. the probability of branch taken and not taken is same

wa
d. the probability of branch taken is very high
Choosing an instruction from the target location of the branch is most effective when the

Gu
branch is highly likely to be taken. This practice optimizes the pipeline performance by
allowing the execution of an instruction from the taken branch's target, improving overall
instruction throughput.

gy
lo
4. Branch Prediction Buffer with 64 rows is indexed by
a. outcome of last 16 branches
b. outcome of last 8 branches no
c. lower order 6 bits of the address of the branch instruction
ch
d. 64 bits of the physical address of the branch instruction.
Te

The lower-order bits of the address are used as an index to select the appropriate row in the
Branch Prediction Buffer. This index helps determine which prediction information
of

corresponds to the specific branch instruction.


te

5. Which of the following is best description of a (p, q) type branch predictor?


itu

a. It uses the outcome of last p branches to index into the BPB where each entry has a
q-bit predictor.
st

b. It uses the outcome of last 2p branches to index into the BPB where each entry has a
q-bit predictor.
In

c. It uses the outcome of last p branches and last q bits of PC to index into the BPB to
an

decide the predictor.


d. It uses the outcome of last q branches and last p bits of PC to index into the BPB to
di

decide the predictor.


In

A (p, q) type branch predictor uses the history of the outcomes of the last p branches to
index into a Branch Prediction Buffer (BPB). Each entry in the BPB is associated with a
q-bit predictor, which helps determine the predicted outcome of the current branch
instruction.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati

ti
6. What is the latency of the floating-point Multiplier Unit in a MIPS processor?

ha
a. 7

wa
b. 6
c. 4

Gu
d. 1

The floating-point Multiplier Unit in a MIPS processor generally has a latency of 6 cycles

gy
to complete its operation. This means that it takes 6 clock cycles for the multiplier unit to
produce the final result after receiving the input operands.

lo
no
7. Which one of the following branch handling approach allows a branch to take place after
one instruction following the branch instruction?
ch
a. Stall until branch direction is clear
b. Predict Branch Taken
Te

c. Predict Branch Not Taken


d. Delayed Branch
In a delayed branch approach, the instruction immediately following a branch instruction is
of

executed before the branch itself is taken. This allows the processor to use the time while
the branch decision is being resolved. It aims to minimize the impact of branch delays by
te

overlapping execution of instructions from both sides of the branch.


itu

8. Among the listed operations, which one is not having a fully pipelined implementation in a
MIPS processor?
st

a. Floating Point Add


In

b. Floating Point Subtract


c. Floating Point Multiply
an

d. Floating Point Divide


di

Floating Point Divide is a complex operation that involves multiple stages and
In

dependencies, making it difficult to achieve full pipelining due to the variable latency and
inter-stage dependencies. As a result, Floating Point Divide operations tend to have longer
latencies and are not as amenable to pipelining as the other operations listed.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati
9. Consider a (2,2) type branch predictor. BHT is indexed by the outcome of the last 2
branches. The BPB is initialized for NN/NT/TN/TT as 00/00/11/11 and is indexed with an

ti
NN entry in the first reference. Consider the last 6 actual outcomes of a single static branch,

ha
{oldest N N T T T N latest} where T means branch is taken and N means not taken. What
will be the contents of BPB after the execution of the above mentioned 6 branch outcomes?

wa
[2 marks]
(A) 01/01/11/00

Gu
(B) 01/01/00/11
(C) 01/01/11/10
(D) 01/01/10/11

gy
lo
BPB Prediction Outcome Misprediction
S.No. Last
Outcome
NN/NT/TN/TT no Y/N?
ch
1. NN (Initial) 00/00/11/11 N N No
Te

2. NN 00/00/11/11 N N No
of

3. NN 00/00/11/11 N T Yes
te
itu

4. NT 01/00/11/11 N T Yes
st

5. TT 01/01/11/11 T T No
In

6. TT 01/01/11/11 T N Yes
an

01/01/11/10
di
In

CORRECT ANSWER: 01/01/11/10

The RED field indicates the entry referred by the predictor. Based on the prediction and outcome
there will be state transition, which is indicated in the next row in the same position. Therefore, at
the end of 6 branch outcomes, the BPB contents will be 01/01/11/10.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati
1. Loop unrolling results in ___

ti
a. increasing register pressure

ha
b. decreasing I cache miss
c. increasing number of control hazards

wa
d. increasing control dependence

Gu
Loop unrolling increase the register pressure because more variables are being processed
simultaneously, potentially requiring more registers to store intermediate values. This can lead to
register spills, where values are stored in memory due to a shortage of available registers.

gy
2. In Tomasulo's algorithm, register renaming is done using ___.

lo
a. Compiler scheduling
b. Reservation Station
c. Reorder Buffer
no
ch
d. Common Data Bus
Te

In Tomasulo's algorithm, register renaming is accomplished using Reservation Stations.


Reservation stations are buffers associated with functional units that hold instruction operands until
of

they're ready to execute. This helps execute instructions out of order, boosts functional unit usage,
and resolves data dependencies by holding the latest values regardless of instruction order. This
te

method enhances performance and efficiency in dynamic instruction scheduling and out-of-order
execution.
itu

3. Which one of the following statements is FALSE?


st

a. A compiler can achieve instruction level parallelism using strip mining.


b. A normal in-order multi cycle MIPS pipeline can never achieve an IPC larger than
In

1.
c. In a MIPS multi-cycle floating point pipeline that supports operand forwarding,
an

there will be six stalls between a pair of adjacent MUL instructions that have a
RAW dependency between them.
di

d. WAW and WAR are true data dependencies.


In

Both WAW and WAR dependencies can be easily resolved through proper instruction scheduling
techniques, and they don't result in incorrect program behavior. The true data dependency that
impacts program execution is the RAW (Read After Write) dependency, where an instruction
depends on the result of a previous instruction.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
4. Register Renaming can solve _____.
a. WAR and WAW hazards

wa
b. RAW hazard only
c. WAR hazard only
d. RAW, WAR, and WAW hazards

Gu
Register renaming can solve WAR (Write After Read) and WAW (Write After Write) hazards. This
technique assigns temporary names to architectural registers, allowing instructions to use the same
register names without causing data hazards. This eliminates potential conflicts that could arise

gy
when instructions read from or write to the same registers.

lo
5. Which of the following is the best match from Set A to Set B?

Set A
no Set B
ch
W. Static Scheduling 1. Reorder buffer
Te

X. Operand Forwarding 2. Loop unrolling


of

Y. Speculative Dynamic Scheduling 3. RSI Update


te

Z. CDB writing 4. Reservation station


itu

a. W→3 X→1 Y →4 Z→2


st

b. W→2 X→4 Y→1 Z→3


In

c. W→2 X→4 Y→3 Z→1


d. W→3 X→2 Y→1 Z→4
an

W → 2 (Static Scheduling → Loop Unrolling): Static scheduling involves determining instruction


di

order at compile-time, similar to how loop unrolling optimizes loops for better performance.
X → 4 (Operand Forwarding → Reservation Station): Operand forwarding allows instructions to
In

use results directly, much like reservation stations store operands for instructions in an out-of-order
pipeline.
Y → 1 (Speculative Dynamic Scheduling → Reorder Buffer): Speculative dynamic scheduling
involves executing instructions speculatively, aligning with the concept of a reorder buffer that
handles out-of-order instruction commitment.
Z → 3 (CDB Writing → RSI Update): Common Data Bus (CDB) transfers data, similar to how
Return Stack Buffer (RSB) updates handle speculative execution outcomes.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
6. In a dynamically scheduled processor that supports speculation, if the register status

wa
indicator of a register Rx is 0, then _____.
a. the latest value of Rx can be obtained from entry #0 in the reorder buffer.
b. the latest value of Rx will be produced by functional unit #0.

Gu
c. the latest value of Rx can be obtained from entry #0 in the reservation station.
d. the latest value of Rx is available in the Register File.
If the register status indicator of register Rx is 0 in a dynamically scheduled processor with

gy
speculation, it means that the latest value of Rx is available in the Register File and can be directly

lo
accessed without any further delays or dependencies.

no
7. Consider an ADD instruction with first operand as Rx and second operand as Ry that is to
be executed in a dynamically scheduled processor that follows Tomasulo’s algorithm. When
ch
the instruction is issued, the seven-tuple entry {Op, Qj, Qk, Vj, Vk, A, Busy} in the
reservation station for this instruction is {ADD, 2, 0, 0, 2, A, 1}. Which of the following
Te

information is correct about the operands of this ADD instruction?


a. Rx value is available from output of functional unit #2, and Ry value is available
of

from output of functional unit #0.


b. Rx value is 2 and Ry value is available from output of functional unit #0.
te

c. Rx value is available from output of functional unit #2, and Ry value is 2.


d. Rx value is 2 and Ry value is 0.
itu

The correct option is that Rx value is available from the output of functional unit #2, and Ry value
is 2. This means the first operand (Rx) of the ADD instruction comes from the result of functional
st

unit #2, and the second operand (Ry) has a value of 2 available.
In

8. Suppose, a load and a store access the same address. If in program order the store appears
an

before the load, interchanging them in execution order can create ____ hazard.
a. RAW
di

b. WAR
c. WAW
In

d. No
RAW hazard occurs when a read instruction depends on the result of a preceding write instruction.
If the load instruction is executed after the store instruction, it might read the old value from
memory instead of the updated value written by the store.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
wa
9. Consider an instruction pipeline with an issue width of 1 that uses Tomasulo's algorithm
with one reservation station per functional unit. There is one Integer MUL unit, one Integer
DIV unit, and one Integer ADD unit, all connected to a single CDB. The functional units

Gu
are not pipelined. An instruction waiting for data on CDB can move to its EX stage in the
cycle after the CDB broadcast. The instructions are:

gy
i. I1: ADDI R1, R1, #8
ii. I2: DIV R3, R2, R1

lo
iii. I3: MUL R4, R1, R3
iv. I4: DIV R5, R4, R1 no
ch
Assume the following information about functional units.
Te

Functional unit type Cycles in EX stage

Integer MUL unit 4


of

Integer DIV unit 8


te
itu

Integer ADD unit 1


st

In which cycle does instruction I3 write to CDB? [2 marks]


In

Answer: 17
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati

1. In a typical GPU kernel execution, which of the following statements is/are FALSE?
a. Threads of the same block can share data.

i
at
b. Data transfer from device to host memory happens after GPU kernel execute.
c. GPU threads can access contents of host memory directly.

ah
d. Data transfer from host to device memory happens before GPU kernel
execute.

uw
Threads of the same block can share data. - True.
Threads within the same block can communicate and share data using shared memory.

G
Data transfer from device to host memory happens after GPU kernel executes. -

gy
True.
Data is typically transferred from the GPU device memory to the host memory after
lo
the GPU kernel execution is complete to retrieve results.
no
GPU threads can access contents of host memory directly. - False.
ch

GPU threads cannot directly access host memory; data must be transferred to GPU
memory first.
Te

Data transfer from host to device memory happens before GPU kernel executes.
of

- True.
Data is transferred from host to GPU device memory before the GPU kernel
te

execution begins to provide necessary input for computation.


itu

2. Which one of the following statements is TRUE?


a. Switching between instruction stream is more frequent in coarse grained
st

multithreading than in fine grained multithreading.


In

b. Hyper threading issues instruction from more than one instruction stream per
slot.
an

c. Multithreading have better resource utilization than hyper threading.


d. Multithreading can give better throughput than hyperthreading.
di

Hyper-Threading (HT) enables a single CPU core to switch rapidly between different
In

threads, allowing it to issue instructions from more than one instruction stream in a
given clock cycle. This improves resource utilization and can enhance performance in
multi-threaded scenarios.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati

i
at
ah
3. Which one of the following is FALSE with respect to a superscalar processor?
a. CPI will be ideally less than 1.

uw
b. It can support multiple instruction issue per clock cycle.
c. There will be multiple functional units, but only one of them can be in busy at

G
any given pointing time.
d. There is operational support for fetching more than one instruction per clock

gy
cycle.
In a superscalar processor, it is not true that only one functional unit can be busy at
lo
any given time. Superscalar processors are designed to have multiple functional units
no
that can work simultaneously to execute multiple instructions in parallel.
ch

4. Which one of the processor execute instruction bundle created by compiler that
exploited parallelism in code?
Te

a. Scalar processors
b. VLIW processors
of

c. Speculative processors
d. SIMD processors
te

VLIW (Very Long Instruction Word) processors execute instruction bundles that have
been explicitly scheduled by the compiler to take advantage of parallelism in the code.
itu

The compiler arranges multiple independent instructions into a single VLIW


instruction bundle, allowing the processor to execute them simultaneously without
st

runtime dependencies. This approach offloads the responsibility of instruction-level


In

parallelism to the compiler rather than relying on complex hardware mechanisms like
in superscalar processors.
an

5. Which execution model is used in a GPU, where each thread executes the same code
di

but on different data elements?


In

e. SIMD
f. SISD
g. MIMD
h. MISD
In a SIMD (Single Instruction, Multiple Data) execution model, a single instruction is
executed simultaneously by multiple threads (or processing elements) on different
data elements. This approach is well-suited for tasks that involve applying the same
operation to multiple data items in parallel, which is a common scenario in graphics
processing on GPUs.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati

i
at
ah
6. In a GPU, which one of the following statement is TRUE wrt memory coalescing?
a. Maximum throughput happens when threads in adjacent warps access same

uw
cache line at a time.
b. Maximum throughput happens when threads in a warp access same cache line
at a time.

G
c. Maximum throughput happens when all threads in a warp access adjancent
rows in memory at a time.

gy
d. Maximum throughput happens when threads in a warp access adjacent cache
lines at a time. lo
no
Memory coalescing in a GPU aims to minimize memory access latency and maximize
memory throughput by ensuring that threads in a warp access memory locations in a
ch

contiguous and aligned manner. Accessing the same cache line at a time by threads in
a warp allows for efficient memory transactions and improved data transfer rate
Te

7. Consider a 1600x1000 HD display with a refresh rate 50 frames/second. It takes 50


instructions to process a pixel. A processor at 1 GHz and average IPC=1 is used to
of

process the display. What is the minimum number of such processors required to
te

ensure quality display streaming? [ 2 marks]


Ans: 4
itu

Display resolution: 1600x1000 pixels


Refresh rate: 50 frames/second
st

Instructions to process a pixel: 50


In

Processor frequency: 1 GHz


Average IPC (Instructions Per Clock): 1
an

Total instructions per frame = Number of pixels × Instructions per pixel = 1600 ×
1000 × 50 = 80,000,000
di

Total instructions per second = Total instructions per frame × Refresh rate =
In

80,000,000 × 50 = 4,000,000,000 instructions/second


The processor's clock speed is 1 GHz, which means it can execute 1 billion
instructions per second (1 IPC * 1 GHz).
Minimum number of processors = Total instructions per second required / Processor's
instruction processing rate
Minimum number of processors = 4,000,000,000 / 1,000,000,000 = 4

Hence, the correct answer is 4 processors to ensure quality display streaming.


NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati

i
at
8. Given an image A represented as a 12x12 pixel matrix. An operation is done on A
by a GPU that is using 2D blocks having 4 threads per block. Consider a pixel P

ah
whose blockIdx.x=3, blockIdx.y=2, threadIdx.x=0, and threadIdx.y=1. If the image A
is stored in row major format in memory from location A[0] to A[143], what is the

uw
index of P in the array?
Ans: 66

G
blockDim.x = blockDim.y = 2 (2*2 = 4, as there are 4 threads per block.)
Row = blockIdx.y * blockDim.y + threadIdx.y = 2*2+1 = 4+1 = 5

gy
Col = blockIdx.x * blockDim.x + threadIdx.x = 3*2+0 = 6
lo
Therefore, Index = row*dim+col = 5*12+6 = 60+6 = 66
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

1. Which one of the following statements is TRUE wrt a m-way set-associative cache
memory organization?

i
at
a. Every cache block in a set will have a tag field.
b. There is only one tag field for each set.

ah
c. Tag comparison happens sequentially from way 0 to way m-1.
d. Cache hit time is dependent on the way in which tag matching happens.

uw
In an m-way set-associative cache, each set consists of m cache blocks, and each
block in the set will have a corresponding tag field. This allows the cache to store

G
multiple blocks in each set and perform parallel tag comparisons when looking up
data in the cache. The tag fields are used to determine whether the requested data is

gy
present in the cache (cache hit) or not (cache miss).
lo
2. The word length of the processor is 16 bits.The address of the first byte of a word in a
no
byte addressable 1 MB physical memory is 0xAB8F2. This word upon bringing to the
cache is mapped to set 30. How many words can be accommodated in each cache
ch

block?
a. 4
Te

b. 8
c. 16
of

d. 32
30 in binary is 11110
te

0xAB8F2 = 0x 1010 1011 1000 1111 0010


Remaining bits for byte offset = 3-bits.
itu

Since each word is 16 bits and is byte addressable(8 bits), 16/8 = 2.


Therefore, number of words that can be accommodated in each cache block = 2^2/2 =
st

2^2 = 4
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

3. Consider a system with 8 KB direct mapped data cache with a block size of 64 bytes.
The system has a physical address space of 64 KB with a word length of 16 bits. How

i
many bits are required to represent the tag field in a cache block?

at
a. 7 bits

ah
b. 5 bits
c. 6 bits

uw
d. 3 bits
#of sets = cache size/ (block size x # of ways/set) = (213) / (26 x 1) = 27 sets.

G
#bits representing set Index = 7 bits

gy
64 KB physical address space memory → 16 bits physical address
Tag=3 bits Index=7 bitslo Offset=6 bits
no

4. Which one of the following statements is TRUE for a write miss in no write allocate
ch

caches?
a. The block containing the missed word is brought to the cache for writing and
Te

will retain it there till it is evicted out.


b. Write the missed word in the main memory only.
of

c. Write the missed word in the main memory and then immediately bring the
modified block to the cache.
te

d. The block containing the missed word is brought to the cache for writing and
itu

then immediately writes back the block to the main memory.


In a no-write-allocate cache policy, when a write miss occurs, the cache does not
st

allocate a block for the write operation. Instead, it directly writes the data to the main
memory (also known as write-through) without bringing the entire block into the
In

cache. This approach avoids unnecessary cache block allocations for write operations
and ensures that the main memory always contains the most up-to-date data.
an
di

5. When a processor requests data from memory, the cache is checked first. Upon
encountering a miss, the cache is loaded first from memory and then the processor is
In

loaded from cache. This type of cache is called_____.


e. Look aside cache
f. Look through cache
g. Look inside cache
h. Look long cache
The type of cache in which the cache is checked first for a requested data, and upon a
cache miss, the cache is loaded first from memory and then the processor is loaded
from the cache is indeed called a "Look-through Cache."
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

6. How many conflict misses are encountered when FIFO cache block replacement
technique is used with a 4-way set associative cache for the following block access

i
at
pattern? Assume initially the cache is empty.

ah
P, Q, R, S, T, P, Q, S, R, T, Q, P
a. 1

uw
b. 3
c. 5

G
d. 0

gy
Block P Q R S lo T P Q S R T Q P
no
Way-1 P P P P T T T T T T T T
ch

Way-2 - Q Q Q Q P P P P P P P
Te

Way-3 - - R R R R Q Q Q Q Q Q
of

Way-4 - - - S S S S S R R R R
Therefore, the total misses are 8 (including 5 compulsory misses and 3 conflict
te

misses).
itu

7. A program is stored in a 16 MB main memory that is attached to a 4 KB direct


st

mapped D-cache with a block size of 16 bytes. The program reads 4 data words A, B,
C and D in that order 5 times (total 20 memory references). Let the physical addresses
In

of A, B, C and D are 0x420424, 0x74042A, 0x740664, 0x74066D, respectively.


Assume the caches are empty initially and one word is 2 bytes. Which of the
an

following statements is/are FALSE?


di

a. Out of the 20 memory references, 9 of them are cache hits.


b. Every access to D will be a hit.
In

c. At the end of 20 memory references, A, C and D are located inside the cache.
d. Every access to C will result in eviction of B from the cache.
#of sets = cache size/ (block size x # of ways/set) = (212) / (24 x 1) = 28 sets.
16MB main memory → 24 bits physical address

Tag=12 bits Index=8 bits Offset=4 bits


NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

i
at
Mapping:
A → 0x420424 Tag=0x420, Set Index: 0x42

ah
B → 0x74042A Tag=0x740, Set Index: 0x42

uw
C → 0x740664 Tag=0x740, Set Index: 0x66
D → 0x74066D Tag=0x740, Set Index: 0x66

G
gy
A and B are mapping to the same set, but they are part of different blocks (tag is
different). Since the cache is direct mapped, A and B will have conflict misses
lo
between them as one will evict the other. C and D have the same tag and set index. So,
no
they are part of the same block and will exist together in the same block without any
conflict. A, B, and C will have compulsory miss and D will not have compulsory miss
ch

as bringing C will automatically bring D also. Hence access to D will always be a hit.
Te

Each access to A and B will be a miss and first access to C will be a miss. So a total of
5+5+1= 11 misses. At the end, B, C and D will be in cache.
of

Average IPC (Instructions Per Clock): 1


Total instructions per frame = Number of pixels × Instructions per pixel = 1600 ×
te

1000 × 50 = 80,000,000
Total instructions per second = Total instructions per frame × Refresh rate =
itu

80,000,000 × 50 = 4,000,000,000 instructions/second


The processor's clock speed is 1 GHz, which means it can execute 1 billion
st

instructions per second (1 IPC * 1 GHz).


In

Minimum number of processors = Total instructions per second required / Processor's


instruction processing rate
an

Minimum number of processors = 4,000,000,000 / 1,000,000,000 = 4


di

Hence, the correct answer is 4 processors to ensure quality display streaming.


In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

8. The following 13 memory block requests A, B, D, A, B, C, E, A, B, E, D, C & D


are mapped to set n of a 4-way set-associative cache memory that uses Practical

i
Pseudo LRU block replacement technique. Assume that set n is initially empty. What

at
will be the contents of set n (in the order way0, way1, way2 and way3) after servicing

ah
all the requests? [Assume that data is entered into an empty cache block from way-0,
way-1, way-2 & way-3 order]

uw
a. ECDA
b. ECAD

G
c. EDCA
d. BEDC

gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In

Total access = 13
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati

Total miss = 5 compulsory misses + 3 conflict misses = 8 misses


Hits= 5
Contents of the set will be in the order ECAD from Way-0 to way-n.

i
at
ah
uw
G
gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati

1. For a cache memory of given capacity, as block size increases, there is


a. an increase in compulsory misses and a decrease in conflict misses.

i
at
b. a decrease in compulsory misses and conflict misses.
c. an increase in compulsory misses and conflict misses.

ah
d. a decrease in compulsory misses and increase in conflict misses.

uw
Increasing block size generally leads to:
A decrease in compulsory misses: Larger blocks fetch more data, reducing the

G
frequency of fetching new blocks from main memory.
An increase in conflict misses: Larger blocks increase the likelihood of multiple

gy
memory addresses mapping to the same cache set, causing cache conflicts and
evictions. lo
no
2. Which one of the following optimization reduces the cache miss penalty?
a. Pipelined caching
ch

b. Multi-level caching
Te

c. Way prediction
d. Multibanked caching
of

In a multi-level caching system, frequently accessed data is more likely to be found in


a smaller, faster cache (e.g., L1 cache), reducing the cache miss penalty compared to
te

accessing data directly from main memory. As you move down the cache hierarchy,
itu

the cache sizes typically increase, providing a balance between low-latency access for
frequently used data and larger capacity for less frequently used data.
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati

3. Which one of the following statements is FALSE?


a. Victim cache is added to a cache to hold recently evicted cache lines.

i
at
b. Early restart and critical word first techniques reduce miss penalty.
c. Harware prefetching reduces cache hit time.

ah
d. Non-blocking cache results in increased cache bandwidth.

uw
Hardware prefetching is a technique used to anticipate future memory accesses and
fetch the required data into the cache before it's actually needed. While prefetching

G
can help in reducing cache miss penalties by ensuring that the data is already in the
cache when needed, it doesn't directly reduce the cache hit time. Prefetching aims to

gy
mitigate cache miss penalties rather than accelerating cache hit times.
lo
no
4. Which one of the following statements is TRUE?
a. Way prediction technique reduces miss penalty in caches.
ch

b. The conflict miss rate is low in a direct-mapped cache compared to a set


Te

associative cache of similar cache configuration.


c. Direct-mapped caches cah have associativity larger than one
of

d. Pipelined caches helps in faster clocking rate for cache.


Way prediction is a technique used to predict which cache way (set) will contain the
te

desired data, improving the cache hit rate. - FALSE


itu

The conflict miss rate is low in a direct-mapped cache compared to a set associative
cache of similar cache configuration as set associative can accommodate number of
st

sets block in same set - FALSE


In

Direct-mapped caches have associativity one - FALSE


Pipelined cache designs can optimize cache access and data retrieval processes,
an

potentially allowing for faster cache operation and data throughput. - TRUE
di
In

5. The average memory access time for a memory hierarchy system with one level of
cache and a main memory is 6 ns. The hit time and miss penalty of the cache is 2 ns
and 100 ns, respectively. The hit rate of the cache (round off to two decimal places) is
e. 0.94
f. 0.96
g. 0.02
h. 0.04
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati

AMAT = 6 ns
Hit Time = 2 ns

i
at
Miss Penalty = 100 ns

ah
6 ns = 2 ns + (Miss Rate * 100 ns)
Miss Rate * 100 ns = 6 ns - 2 ns

uw
Miss Rate * 100 ns = 4 ns
(since hit rate + miss rate = 1):

G
Hit Rate = 1 - Miss Rate
Hit Rate = 1 - 0.04 = 0.96

gy
lo
no
6. Assume an L1 cache with a hit rate of 85%, and an L2 cache with a local miss rate
of 4%. If there are 1500 memory access initiated by CPU, then the number of memory
ch

access that will find a hit in L2 cache is _____.


Te

Ans: 216
of

Number of access that will miss in L1 cache = (Miss rate_L1 * # memory access to
L1) = 0.15*1500 = 225
te

Hence there are 225 memory accesses to L2 cache.


itu

Number of miss in L2 cache = (Miss rate_L2 * # memory access to L2)


= 0.04*225 = 9.
st

Number of memory accesses that will hit in L2 = 225-9=216


In

7. A cache has a hit time of 10 ns and hit rate of 60%. An optimization was made to
an

increase hit rate to 70% but the hit time was increased to 15 ns. The optimization
resulted in a 10% reduction in average memory access time. Assume that the miss
di

penalty is unaffected by the optimization. The miss penalty of the cache (in ns) is
_____.
In

Ans: 100 [range 100 to 100]


Hit time_old = 10 ns
Hit rate_old = 0.6
Hit time_opt = 15 ns
Hit rate_opt = 0.7
AMAT = Hit time + Miss rate * Miss penalty_L1
AMAT_opt = 0.9* AMAT_old
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati

AMAT_opt/AMAT_old = 0.9
0.9 = (15 + (0.3 * x) / (10 + (0.4 * x))

i
at
x = 100
Miss Penalty_L1 = 100 ns

ah
uw
8. A 32-bit word processor is connected to a 16 KB, 4-way set-associative L1 cache
having a block size of 32 B. Total physical address space is 256 MB. When an L1

G
cache miss occurs, it takes 25 cycles to fetch the first word of a block from L2 cache
and 4 cycles for each subsequent word in the block. Assume that processor is stalled

gy
due to an L1 cache miss occurred on a word whose first byte address is 0x3416ACC.
Assume that the word is a hit in L2 cache. How many cycles will the processor stall
lo
before it resumes execution if an early restart optimization done on L1 cache.?
no
Ans: 37 [range 37 to 37]
ch

Main Memory = 256MB = 2^28 = 28 bits


L1 Cache: # sets = CS/(BS * Asso) = 2^14/(2^5 * 2^2) = 2^7 = 7 bits
Te

1 word= 4B
Block Size = 32B🡪 # words/blocks = 32/4 = 8 = 3 bits
of

tag(16) set(7) Block offset (5 = 3+2)


te
itu

L1 cache miss @ 0x3416ACC


st

0x 3 4 1 6 A C C
L1🡪 0011 0100 0001 0110 1010 1100 1100 86 word→ 3 (4th word)
In

L1 cache miss: 25 cycles (first word) + 4 cycles (for new word)


Early Restart = 25 + 3x4 = 37 cycles
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati

1. Which one of the following statements is FALSE?


a. Write propagation ensures that a write is eventually seen by all the threads.

i
at
b. Memory consistency provides local ordering of accesses to all words in a
cache block.

ah
c. Cache coherence provides local ordering of accesses to each cache block.

uw
d. Write serialization ensures that writes to the same location are globally
ordered.

G
Memory consistency provides local ordering of accesses to all words in a cache
block. This statement is false. Memory consistency primarily focuses on the order in

gy
which memory operations are perceived by different threads, not on the local ordering
lo
of accesses to specific words within a cache block.
no
2. What is the purpose of Write Propagation in memory consistency and cache
ch

coherence?
a. It delays write operations to ensure coherence.
Te

b. It ensures that the value written in one cache is propagated to at least one
sharer in a predetermined order.
of

c. It guarantees that a write is eventually seen by all threads.


d. It blocks propagating the write values to other threads to ensure coherence.
te
itu

Write propagation ensures that when a write operation is performed, the modified data
is eventually made visible to all threads in a multi-threaded system. This is essential
st

for maintaining memory consistency and ensuring that all threads have a coherent and
In

up-to-date view of the data. Write propagation helps prevent data inconsistencies and
synchronization issues in multi-threaded programs.
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati

3. Which one of the following protocols ensures that a cache controller sends broadcast

i
at
messages in a common medium for other cache controllers connected to it for taking
appropriate cache coherence operations?

ah
a. Write serialization protocols

uw
b. Snoopy protocols
c. Directory based protocols

G
d. Consistency protocols
Snoopy protocols are a class of cache coherence protocols in which each cache

gy
controller monitors or "snoops" the common communication medium (usually a
shared bus) for transactions initiated by other cache controllers. When a cache
lo
controller observes a transaction that may affect its own cache's data, it takes
no
appropriate cache coherence actions, such as invalidating or updating its cache line to
ch

maintain data consistency. These protocols rely on broadcast messages to keep caches
coherent and are efficient for smaller-scale systems.
Te

4. Consider a MESI cache coherence protocol. The cache controller of processor P2


of

snooping on the bus receives a broadcast message that processor P1 encountered a


read miss on a cache block B that is in M state in P2. Apart from forwarding the copy
te

of B to P1, what will be the state transition done by P2 on the cache block B?
itu

a. Retain the state of B as M.


b. Change the state of B to S.
st

c. Change the state of B to I.


In

d. Change the state of B to E.


The MESI protocol transitions from M (Modified) to S (Shared) when another
an

processor requests the same data (encounters a read miss), and the processor that had
di

it in the Modified state has to relinquish its exclusive ownership. Changing the state to
Shared means that multiple processors can have the data in their caches in a
In

non-exclusive manner.
5. Which of the following are the advantages of using a directory based cache coherence
protocol over a snooping based cache coherence protocol? [Multiple correct
answers]
A. Reduced cache size requirements for a given workload
B. Less contention in accessing the directory
C. Elimination of broadcast messages
D. Scalability of processors attached to the interconnect
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati

i
at
Less contention in accessing the directory: Directory-based protocols have lower

ah
contention for shared resources, improving system performance.
Scalability of processors attached to the interconnect: They are better suited for large

uw
multiprocessor systems where snooping-based protocols may become inefficient due
to increased complexity and bus congestion.

G
Reduced cache size requirements: is not a typical advantage of directory-based
protocols.

gy
Elimination of broadcast messages: is not entirely true, as some messaging or
lo
directory access is still needed in directory-based protocols.
no
6. If two co-operating processors P1 and P2 write to two different words W1 and W2,
ch

respectively of a cache block B. The system uses directory cache coherence protocol. Which
of the following statements is/are TRUE? [Multiple correct answers]
Te
of

a. If memory accesses of P1 to W1 and P2 to W2 are strictly interleaved, then


there will be no coherence misses at all.
te

b. There exists a false sharing of block B between P1 and P2.


itu

c. There exists a true sharing of block B between P1 and P2.


d. If memory accesses of P1 to W1 and P2 to W2 are strictly interleaved, then the
st

cache block B keeps bouncing between P1 and P2.


In

There exists false sharing because P1 and P2 are accessing different words within the
an

same cache block, which can lead to unnecessary coherence overhead.


If memory accesses of P1 to W1 and P2 to W2 are strictly interleaved, the cache
di

block B can keep bouncing between P1 and P2 due to continuous state transitions
In

caused by interleaved accesses. This is known as cache line bouncing or thrashing and
can result in inefficient use of system resources.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati

i
at
7. Consider a directory based coherence system for a 256 GB physical address space.
A 16-core processor is connected to this physical address. Each core has a 128 KB,

ah
4-way set-associative private cache memory of block size 256 bytes. Assume that the
central directory will store information of most frequently used 1024 cache blocks

uw
only. Each directory entry will store the state (2 bits to represent one of the 3 states :
E, U and S), block number and list of shares (1-bit per core). What is the storage

G
space consumed by the directory in bytes? [2 marks]
a. Ans: 6144

gy
Solution: # blocks in this system = 238/28 = 230
lo
Bits in each directory entry: 2 (state) + 30 (block number) + 16 (1-bit per core) = 48
Each directory entry is 48 bits → 6 bytes.
no
Directory stores information about most frequently used 1024 cache blocks.
ch

Hence directory storage is 1024 x 6 bytes= 6144 bytes


Te

8. Consider a multi-processing system with two cores A and B with their own private
caches and a single shared main memory using MESI cache coherence protocol. The
of

following 4 lines of code are running in each of the two cores.


1. LW R1, M1
te

2. LW R2, M2
itu

3. SW R3, M2
4. SW R2, M1
st

The addresses pointed by M1 and M2 map to different cache blocks. Consider the
In

following execution sequence in the format; Core-Instruction Number. A-1, A-2, B-2,
B-1, B-3, A-3, B-4, A-4. If the state of 4 blocks (M1 in A, M2 in A, M1 in B, M2 in
an

B) can be represented as PQRS where P/Q/R/S can be any one of M/E/S/I.


Accordingly, the initial state is IIII. Which one of the following represents the state of
di

these blocks after the execution of the above 8 instruction sequences? [2 marks]
In

a. MMII
b. SSSS
c. MSIS
d. ISMI
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati

i
at
ah
Sequence:

uw
A-1: LW R1, M1
A-2: LW R2, M2

G
B-2: LW R2, M2
B-1: LW R1, M1

gy
B-3: SW R3, M2
A-3: SW R3, M2 lo
B-4: SW R2, M1
no
A-4: SW R2, M1
ch
Te

Core- Ins M1 in A M2 in A M1 in B M2 in B
of

initial I I I I
Answer: MMII
A-1 E I I I
te

A-2 E E I I
itu

B-2 E S I S
st

B-1 S S S S
In

B-3 S I S M
an

A-3 S M S I
di

B-4 I M M I
In

A-4 M M I I
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati

1. Which component of access time in Hard Disk dominates for very short seeks (2-4
cylinders)?

i
at
a. Settle time
b. Coast time

ah
c. Speedup time

uw
d. Slowdown time

G
During short seeks, where the heads are moving only a small distance, the time it
takes for the heads to settle in the correct position can be a significant portion of the

gy
total access time. This is because the heads need to overcome mechanical inertia and
vibrations to accurately position themselves over the desired data track.
lo
no
2. In a DRAM system that follows open row buffer management policy, which of the
following sequence of commands are generated if the new request is to a different row
ch

from the row that was accessed last?


Te

a. Activate
b. Precharge followed by CAS
of

c. Activate followed by Precharge


d. Precharge followed by Activate
te
itu

In a DRAM system with open row buffer management, when accessing a different
memory row than the last one, the following sequence occurs:
st

Precharge: The current row is reset to prepare for the next access.
In

Activate: The desired memory row is selected and its data is copied into the row
buffer.
an

CAS (Column Access Strobe): The specific data from the row buffer is accessed.
di

So, the sequence is Precharge -> Activate -> CAS.


In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati

i
at
ah
3. Which of the following is NOT a function of the DRAM controller?
a. Translate memory requests to DRAM command sequences.

uw
b. Manage power consumption and thermals in DRAM.
c. Buffer and schedule incoming memory requests.

G
d. Reorganizing the stored data in DRAM for better space utilization

gy
DRAM controllers primarily handle tasks such as translating memory requests to
DRAM command sequences, managing power consumption and thermals in DRAM
lo
(to some extent), and buffering/scheduling incoming memory requests. Reorganizing
no
data for better space utilization is typically a function of higher-level memory
management or file systems and is not a direct responsibility of the DRAM controller.
ch

4. What is the primary goal of disk scheduling algorithms?


Te

a. Minimize bits required to store a file


b. Maximize rotational latency
of

c. Minimize seek time


d. Maximize bit-cell density
te

Disk scheduling algorithms aim to reduce the time it takes for the read/write heads of
itu

a hard disk drive to seek (move) from their current position to the desired track or
cylinder where data needs to be read or written. Minimizing seek time helps improve
st

the overall efficiency and performance of disk I/O operations.


In

5. What is the purpose of the refresh operation in DRAM?


an

e. To maintain high data density


f. To lower manufacturing costs
di

g. To prevent data loss over time


In

h. To increase data access speed


DRAM stores data as electrical charges in tiny capacitors within each memory cell.
These charges tend to leak away over time due to the inherent electrical properties of
the capacitors. The refresh operation is designed to periodically read and then
immediately rewrite the data stored in each memory cell to ensure that the charge
levels are maintained. By doing so, it prevents the loss of data that could occur if the
charge levels were allowed to decay.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati

6. A 64 GB DRAM system that uses 4 channels (C0, C1, C2 and C3) has 2048
columns per row. It uses 64-bit wide memory bus to transfer data from DRAM to the

i
at
processor. If adjacent memory words are mapped on to adjacent memory channels,
which channel will fetch the physical address 0x2953A1B5C?

ah
a. C0
b. C1

uw
c. C2

G
d. C3

gy
64GB DRAM- 36-bit physical address
If adjacent memory words are mapped on to adjacent memory channels, then
lo
channel bits will be just before the last 3 bits (Byte within bus).
no
The address split up is as follows.
Rest (rank+ row) column channel Byte within bus
ch

11 2 3
Te

Check the channel bits of the address 0x2953A1B5C


0x2953A1B5C in binary = 0010 1001 0010 0101 1010 0001 1011 0101 1100
of

7. A 4 GB hard disk that has only 1 magnetic surface for storing data has 256
te

cylinders and there are 128 sectors per track. If all sectors/cylinder are storing same
amount of data, the maximum size of a file that occupies 8 sectors of a cylinder in KB
itu

is ______.
st

Ans: 1024
Size of hard disk = 4GB = 2^32
In

Number of cylinders = 256 = 2^8


an

Number of sector per track = 128 = 2^7


maximum size of a file that occupies 8 sectors of a cylinder = 2^32 / (2^8 * 2^7/2^3)
di

= 2^32 / (2^8 * 2^4)


In

= 2^20 = 1024KB
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati

i
at
8. A disk drive has 200 cylinders numbered from 0 to 199. The disk arm is initially
positioned at cylinder 50. There are now five pending disk requests (cyliner numbers)

ah
in the queue: 72, 55, 40, 90, 5. Calculate the total head movements to service all these
requests using the SSTF disk scheduling algorithm.

uw
G
Answer: 155
Order is 50, 55, 40, 72, 90, 5

gy
Cylinder is at 50 initially
lo
Move from cylinder 50 to cylinder 55: 55 - 50 = 5 movements
no
Move from cylinder 55 to cylinder 40: 55 - 40 = 15 movements
Move from cylinder 40 to cylinder 72: 72 - 40 = 32 movements
ch

Move from cylinder 72 to cylinder 90: 90 -72 = 18 movements


Move from cylinder 90 to cylinder 5: 90 - 5 = 85 movements
Te

Total= 5+15+32+18+85=155
of

9. Consider a 1MB DRAM on a single DIMM with two ranks, 16 banks (named as
B0, B1, B2.. B15) per rank and 32 columns per row. The data bus width is 16 bytes.
te

The addressing uses row interleaving. Which of the following physical addresses is
itu

mapped to bank number B5? [2 marks]


a. 0x72AC6
st

b. 0x55587
In

c. 0x65B24
d. 0x578B5
an

1MB DRAM- 20-bit physical address


di

In row interleaving the bank bits are between column and row bits.
The address split up is as follows.
In

Rest (rank+ row) bank column Byte within bus


7 4 5 4
Check the bank bits in each of the address
0x72AC6 in binary = 0111 0010 1010 1100 0110
0x55587 in binary = 0101 0101 0101 1000 0111
0x65B24 in binary = 0110 0101 1011 0010 0100
0x578B5 in binary = 0101 0111 1000 1011 0101
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati

1. What is the basic unit of flow control between a pair of adjacent routers in an NoC?
a. Crossbar

i
at
b. Packet
c. Buffer

ah
d. Flit

uw
The basic unit of flow control between a pair of adjacent routers in a
Network-on-Chip (NoC) is typically a "Flit". Flits are smaller than traditional packets

G
and are used to control the flow of data through the NoC by dividing data into smaller,
manageable units. They help in routing data efficiently and managing congestion in

gy
on-chip networks.

2. lo
What does the term topology specify in on-chip networks?
no
a. The way routers are connected
b. The routing algorithm used
ch

c. The flow control mechanism


d. The size of the network
Te

In the context of on-chip networks, "topology" specifies the way routers


(communication points) are connected to each other on a microchip. It's like the
of

blueprint for how data travels within the chip. Different topologies have different
te

benefits and drawbacks.


itu

3. When we use source routing in on-chip networks, where is the routing information
st

stored?
In

a. In the packet header


b. In the router's table
an

c. In the virtual channel


d. In the crossbar
di

In on-chip networks that use source routing, the routing information is typically stored
In

in the "packet header." Each packet contains the necessary information to specify the
route it should take through the network. This information is placed in the header of
the packet, and routers use it to determine the path the packet should follow to reach
its destination.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati

4. In an NoC, following are the functions performed by a router in order to forward a

i
packet received in its input port to an appropriate output port.

at
a. Route Computation

ah
b. Buffering of Flits
c. Switch Allocation

uw
d. VC Allocation
e. Switch Traversal

G
f. Link Traversal

gy
Which of the following is the correct sequence in which the functions are performed?
a. i-ii-iii-iv-v-vi
b. ii-i-iv-iii-v-vi
lo
no
c. ii-i-iii-iv-v-vi
d. v-ii-i-iv-iii-vi
ch

The correct sequence of functions performed by a router in a Network-on-Chip (NoC)


Te

to forward a packet received in its input port to an appropriate output port is:
ii - Buffering of Flits (Incoming flits are buffered to wait for further processing.)
of

i - Route Computation (The router determines the route the packet should take.)
iv - VC Allocation (Virtual Channels are allocated for the packet.)
te

iii - Switch Allocation (The router decides which output port to forward the packet
itu

to.)
v - Switch Traversal (The packet is sent through the switch or crossbar to the
st

designated output port.)


In

vi - Link Traversal (The packet travels across the link to the next router or
destination.)
an

So, the correct sequence is: ii - i - iv - iii - v - vi.


di
In

5. In an NoC router, if an incoming packet has more than one potential output port
possible as per the adaptive routing algorithm, one output port is finally chosen by
____
g. input selection strategy
h. spatial scheduling
i. output selection strategy
j. VC allocation
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati

One output port is finally chosen by the "output selection strategy" if an incoming

i
at
packet in a NoC router has more than one potential output port possible according to

ah
the adaptive routing algorithm. The output selection strategy determines which
specific output port the packet should be forwarded to based on factors such as

uw
congestion, availability, and other routing criteria.

G
6. Which one of the following statements is FALSE?

gy
a. XY routing is minimal and always deadlock free.
b. Odd even routing is adaptive but not deadlock free.
lo
c. East first routing can be non-minimal.
no
d. North last routing is deadlock free.
ch

Odd-even routing can be deadlock-free when implemented correctly. Deadlock


prevention mechanisms can be applied to ensure deadlock-free operation, so it is not
Te

inherently deadlock-prone. However, it is true that odd-even routing is adaptive,


which means it can dynamically choose different paths based on network conditions.
So, this statement is not accurate.
of

7. In a 6x6, 2D-mesh network on chip, the number of routers in the network that are
te

directly connected to four other routers as well as to local tile is ____.


itu

Ans: 16
st
In

1 2 3 4
an

5 6 7 8
di

9 10 11 12
In

13 14 15 16

In a 6x6 2D-mesh network on chip, except boundary routers all other routers are
directly connected to its four neighboring routers (north, south, east, and west) within
the mesh.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati

8. Consider an 64-tile system that uses a square mesh NoC topology where routers

i
follows minimal odd-even routing. The packet P1 travels from router 18 to 36, how

at
many unique paths exist for this packet to reach its destination?

ah
Ans: 3

uw
G
gy
56 57 58 59 60
lo
61 62 63
no
48 49 50 51 52 53 54 55
ch

40 41 42 43 44 45 46 47
Te

32 33 34 35 36 37 38 39
of

24 25 26 27 28 29 30 31
te

16 17 18 19 20 21 22 23
itu

8 9 10 11 12 13 14 15
st

0 1 2 3 4 5 6 7
In

The paths from 18 to 36 are:


1: 18->19->27->35->36
an

2: 18->26->27->35->36
di

3: 18->26->34->35->36
In

9. Consider an 8x8 mesh NoC that XY routing to forward packets. Consider 3


packets P1, P2 and P3 who details (packet number, source, destination) are given.
<P1, 18, 44>, <P2, 39, 12> and <P3, 23, 3>. What is the router number through which
all the three packets will pass though?

Answer: 20
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati

56 57 58 59 60 61 62 63

i
at
48 49 50 51 52 53 54 55

ah
40 41 42 43 44 45 46 47

uw
32 33 34 35 36 37 38 39

24 25 26 27 28 29 30 31

G
16 17 18 19 20 21 22 23

gy
8 9 10 11 12 13
lo 14 15
no
0 1 2 3 4 5 6 7
ch

P1: 18 -> 19 -> 20 -> 28 -> 36 -> 44


Te

P2: 39 -> 38 -> 37 -> 36 -> 28 -> 20 -> 12


P3: 23 -> 22 -> 21 -> 20 -> 19 -> 11 -> 3
of

Common router in all 3 path is 20


te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 11: Detailed Solution
Indian Institute of Technology Guwahati

1. In sidebuffered deflection routers, what is the role of side buffers?


a. To eliminate link contention

i
at
b. To store deflected flits temporarily
c. To store flits that are golden

ah
d. To reduce port conflict

uw
In side-buffered deflection routers, the role of side buffers is to "store deflected flits
temporarily." When a router experiences congestion and cannot immediately forward

G
a flit to its intended output port, it may deflect the flit to a side buffer. The side buffer
temporarily holds the deflected flits until they can be transmitted when the output port

gy
becomes available. This helps in reducing congestion and preventing packet loss in
lo
the network by allowing flits to be stored temporarily.
no
2. What is hot-potato routing?
ch

a. Routing based on packet age


b. Routing based on deflection frequency
Te

c. Routing based on any available output port


d. Routing based on network congestion
of

Hot-potato routing is "routing based on any available output port" rather than waiting
for an optimal or congestion-free route. It is used to reduce latency and prevent
te

packets from being held up in the network due to congestion.


itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 11: Detailed Solution
Indian Institute of Technology Guwahati

3. At any given point in time, the maximum number of silver flits in a 5x5 mesh NoC
realized using MinBD routers is ____.

i
at
a. 1
b. 5

ah
c. 25

uw
d. 10
In MinBD routers there can be only 1 silver flit at each router at any point of time.

G
Therefore, the maximum number of silver flits in a 5x5 mesh NoC realized using
MinBD routers is 5*5 = 25

gy
4. Which of the following is TRUE with respect to buffer-less deflection router
lo
CHIPPER?
no
a. One packet per router is identified as golden and golden packets are never
deflected.
ch

b. Ejection stage is kept after the inject stage in the router pipeline.
Te

c. Port allocation is done using a parallel PDN logic.


d. Priority of flits are identified after a total sorting of flits based on age.
of

CHIPPER is a buffer-less deflection router that uses parallel Path Determination


Network (PDN) logic for port allocation. This means that PDN logic is used to
te

determine the output port for each incoming flit, and it's done in parallel to allow flits
itu

to be routed without buffers, which helps in reducing latency and complexity in the
router.
st
In

5. Which of the following is TRUE with respect to deflection router MinBD?


a. All the flits assigned with non-productive ports by PDN are forwarded to side
an

buffer.
b. At most two flits can be ejected per router per cycle.
di

c. In the router pipeline, the buffer inject unit is kept after the buffer eject unit.
In

d. Once a flit becomes silver, it is no longer deflected till it reaches its destination

MinBD routers are designed to reduce bisection congestion in network-on-chip (NoC)


designs. They are often designed in a way that allows for at most two flits to be
ejected per router per cycle to help manage congestion efficiently.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 11: Detailed Solution
Indian Institute of Technology Guwahati

6.Which of the following is FALSE?

i
at
a. CHIPPER router uses golden packet scheme for flit prioritization
b. BLESS uses sequential port allocation logic

ah
c. MinBD uses quadrant routing algorithm

uw
d. DeBAR has single ejection port.

G
MinBD routers are typically associated with deflection routing and are used to reduce
bisection congestion in network-on-chip (NoC) designs. While the specific routing

gy
algorithm in MinBD routers can vary, they usually do not use the quadrant routing
algorithm. Instead, they may employ adaptive or deflection routing strategies. So, the
lo
statement regarding the use of the quadrant routing algorithm is false.
no
ch

7. Which one of the following uses quadrant routing technqiue?


Te

a. SLIDER
b. CHIPPER
of

c. DeBAR
d. MinBD
te

DeBAR is a specific router architecture designed for on-chip networks and uses
itu

quadrant-based routing algorithms to make routing decisions.


st
In

8. Which one of the following uses restrictive and non-restrictive injection?


an

a. SLIDER
di

b. CHIPPER
In

c. DeBAR
d. MinBD
SLIDER uses both restrictive and non-restrictive injection techniques. Restrictive
injection only allows flits to enter the network when a clear path is available, while
non-restrictive injection permits flits to enter even if they may need to be deflected.
This helps SLIDER manage network congestion and improve routing efficiency in
on-chip networks.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 11: Detailed Solution
Indian Institute of Technology Guwahati

9. Consider a 4x4 NoC with CHIPPER routers. Preferred port is chosen by XY


routing. Consider 4 packets that reach router 6 (routers are numberred from 0 to 15).

i
at
The details of the packets (Packet number, Golden, Input Port, Destination) <P1, Yes,

ah
S, 10>, <P2, No, W, 14>, <P3, No, E, 9> and <P4, No, N, 6>. Tie between two
non-golden flits are resolved using packet number. Higher the packet number, higher

uw
the priority. How many packets get productive output port in PDN?

G
Answer: 2

gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In
In
di
an
In
st
itu
te
of
Te
ch
no
lo
gy
G
uw
ah
at
i
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
1. The number of cycles a packet can be delayed in the network without reducing application’s
performance is known as _____.

wa
a. Packet latency
b. Network stall time
c. Slack

Gu
d. Critical time
A slack of a packet is defined as the number of cycles it can be delayed in the network

gy
without reducing the application’s performance.

lo
2. Intel KNL has _____ tiles in 2D mesh.
a. 8
b. 36
no
ch
c. 64
Te

d. 16
Intel KNL has 36 Tiles interconnected by 2D Mesh
of

3. Which one of the following emerging NoCs uses the concept of diverting light to a certain
te

wavelength when voltage is applied?


itu

a. 3D NoC
st

b. Nanophotonics
In

c. RF waveguide wireless communication


an

d. Vertical interconnect
di

Nanophotonics is a promising technology for network building blocks, due to their


inherently low latency, high throughput, and low dynamic energy requirements.
In

Nanophotonics uses a microring resonator which diverts light of a certain wavelength when
a voltage is applied.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
4. Which one of the following is a 64-bit dual core VEGA processor?

wa
a. VEGA AS1061
b. VEGA AS2161

Gu
c. VEGA AS4161
d. VEGA AS1161
VEGA AS2161: VEGA AS2161 64-bit dual core 16-stage pipeline out-of-order RISC-V

gy
processor. Refer this for details: https://vegaprocessors.in/vega.php

lo
5. Which one of the following statements is TRUE about layerwise DNN computation done on
a TCMP system?
no
ch
a. Filter/weights are moved from global buffer to off-chip memory
b. Output feature map is progressed from off-chip memory to global buffer
Te

c. Global buffer is directly connected to each PE using a dedicated bus


d. Filter/weights are moved from off-chip memory to global buffer
of
te
itu
st
In
an

As shown above, layerwise DNN computation on a TCMP system involves the movement
di

of Filter/weights from off-chip memory to global buffer


In

6. Consider a 16 x16 Cmesh NoC structure in which each node is connected to 4 processing
cores. The entire mesh structure is divided into 16 Wcubes. Each Wcube is marked with a 4
bit number. [Wcube- 0 (0000) to Wcube- 15 (1111)] Identify the correct Wcube to which
Wcube-15 can be directly communicated?
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati

a. Wcube- 9

ti
b. Wcube- 13

ha
c. Wcube- 10

wa
d. Wcube- 8

Gu
Direct communication is possible only with a Wcube that is at hamming distance 1 (they
differ only at one bit position). Here 15 (1111) and 13 (1101) only differ in one bit position.

gy
lo
7. Consider a TCMP system with 64 tiles, where each tile consists of a superscalar processor,
no
a private L1 cache and a shared distributed L2 cache. The total L2 cache on the chip is
16MB and L2 uses 64B blocks and is 8-way associative. Each L2 cache slice on-chip has
ch
all the 8 ways of the sets assigned to it. The L2 cache memory per tile division is such that
total sets in L2 cache are uniformly partitioned across tiles in a sequential fashion. The
Te

system uses a 32-bit physical address. How many L2 cache sets are mapped per tile?
of

Correct answer: 512


te

L2 size per tile= 16MB / 64 = (24 x 220) / 26 = 218 Bytes


itu

No of sets in one L2 cache = (Size of cache) / ((size of block) * (associativity))


st

= (218) / (26 x 23) = 29= 512


In

8. Consider a TCMP system with a 4x4 mesh NoC where each tile consists of a superscalar
an

processor, a private L1 cache and a shared distributed L2 cache. Let T0, T1, T2... ,T15
di

corresponds to the tiles where T0 is the bottom left tile and T15 the top right tile. Each tile
has a 16KB 2-way associative L1 cache with a block size of 16B. The total L2 cache on the
In

chip is 32MB and L2 uses 128B blocks and is 16-way associative. Each L2 cache bank has
all the 16 ways of the sets assigned to it. The L2 cache memory per tile division is such that
total sets in L2 cache are uniformly partitioned across all tiles in sequential fashion. The
system uses a 40-bit physical address. T4 generated an L1 cache miss for the address
A1=0x A8CD210652. As per L2 set mapping, tile Tx host the L2 set for A1. What is the
value of x? (Hint: Possible value of x ranges from 0 to 15).
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number -12: Detailed Solution
Indian Institute of Technology Guwahati

ti
ha
Correct answer: 0

wa
L2 size per tile= 32MB / 16 = 2MB = 221Bytes

Gu
No of sets in one L2 cache= (221) / (27x 24) = 210
So, the address distribution is: tag =19, tile=4, set index within tile 10; byte offset =7

gy
0xA8CD210652 = 1010 1000 1100 1101 0010 0001 0000 0110 0101 0010

lo
First 7 bits from left are offset bits, next 10 bits are set bits and then next 4 bits are tile
no
number bits. Rest of the bits are tag bits. Extract tile bits.
So, tile bits are: 0000 = 0
ch
So the miss request travels from T4 to T0 through the NoC.
Te
of
te
itu
st
In
an
di
In

You might also like