COA (Week 1 To Week 12 Detailed Solution)
COA (Week 1 To Week 12 Detailed Solution)
ti
1. In a typical execution cycle, which one of the following sequences accurately depicts the
ha
steps involved?
a. Instruction fetch → Operand fetch → Instruction decode → Execute → Result store
wa
b. Instruction decode → Instruction fetch -> Operand fetch → Execute → Result store
c. Instruction fetch → Instruction decode → Operand fetch → Execute → Result store
Gu
d. Operand fetch → Instruction fetch → Instruction decode → Execute → Result store
In a processor's typical execution cycle, instructions follow a fetch-decode-execute pattern.
Initially, the processor fetches the next instruction from memory using the program counter.
gy
Afterward, the fetched instruction is decoded to identify its operation and operands.
Following this, the necessary operands are fetched from the register file or memory. With
lo
the operands in hand, the instruction is executed, carrying out the intended operation.
no
Finally, the resulting output is stored in the destination register or memory. This continuous
cycle enables efficient execution of instructions by the processor.
ch
2. Which one of the following is the feature of the Little Endian scheme?
Te
integers, floating-point numbers) are stored at lower memory addresses, and the
itu
higher-order bytes are stored at higher memory addresses. This means that the least
significant byte is stored first in memory, at the smallest address, followed by the more
significant bytes in increasing order of significance.
st
In
3. In terms of processor memory interaction, what is the role of the Program Counter?
a. It contains the address of the instruction currently being executed
an
The Program Counter is a register that keeps track of the memory address of the next
instruction that the processor needs to fetch and execute. During the fetch phase of the
processor's instruction cycle, the PC is used to access the memory location containing the
next instruction. Once the instruction is fetched, the PC is updated to point to the address of
the subsequent instruction in memory, enabling the processor to continue executing
instructions sequentially.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati
ti
4. CISC architecture attempts to minimize the number of instructions per program but at the
ha
cost of:
a. Using a larger memory
wa
b. Using a wider bus to carry an instruction from memory
c. Decrease in the average number of cycles per instruction
d. Increase in the average number of cycles per instruction
Gu
In CISC (Complex Instruction Set Computer) architecture, complex and multi-step
instructions are used to reduce the number of instructions needed to perform a particular
task. However, these complex instructions often require multiple cycles to execute, leading
gy
to an increase in the average number of cycles per instruction. This can result in longer
lo
execution times for individual instructions compared to simpler architectures like RISC
(Reduced Instruction Set Computer).
no
5. A processor has 8 general-purpose registers. It uses a 24-bit instruction format. If the
ch
opcode field occupies 6 bits followed by the register field that stores a register address, how
many bits are left for other fields in the instruction?
Te
a. 14 bits
b. 8 bits
of
c. 10 bits
d. 15 bits
te
Solution:
Total bits used for opcode field + register field = 6 + 3(for 8 registers) = 9 bits
itu
Since the instruction format is 24 bits in total, there are 24 - 9 = 15 bits left for other fields.
st
Push D
Push C
an
Push B
Mult
di
Add
Pop A
In
If the values in memory locations B, C, and D, are 6, 2, and 4, respectively. What will be
stored in memory location A after the execution of the above program?
a. 14
b. 16
c. 26
d. 12
Push D: Pushes the value of D (4) onto the stack.
Stack: [4]
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 1: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
Push C: Pushes the value of C (2) onto the stack.
Stack: [4, 2]
wa
Push B: Pushes the value of B (6) onto the stack.
Stack: [4, 2, 6]
Mult: Multiplies the top two values on the stack (2 * 6) and pushes the result (12) onto the
Gu
stack.
Stack: [4, 12]
Add: Adds the top two values on the stack (4 + 12) and pushes the result (16) onto the
gy
stack.
lo
Stack: [16]
Pop A: Pops the top value (16) from the stack and stores it in memory location A.
no
After the execution of the program, the value 16 will be stored in memory location A.
ch
Te
7. A program has 30% Load instructions, 20% Store instructions, 40% ALU instructions, and
10% Branch instructions. On a processor each Load, Store, ALU and Branch instructions
take 4, 3, 1 and 2 cycles, respectively. What is the average CPI (Cycles Per Instruction) on
of
ti
Average CPI = 1.2 + 0.6 + 0.4 + 0.2, Average CPI = 2.4
ha
8. A new Graphics Processing Unit (GPU) is added to a system, which speeds up the
wa
execution of graphics-related instructions by 6 times. If a program has 50% graphics-related
instructions, what is the overall speedup gained while running the program on the system
with the GPU compared to running it on the system without the GPU? [ 2 marks]
Gu
Given:
gy
Fraction enhanced = 50% = 0.5
lo
Speedup = 6
no
Overall speedup = 1 / [(1 - fraction enhanced) + (fraction enhanced / speedup)]
Overall speedup = 1 / [(1 - 0.5) + (0.5 / 6)]
ch
= 1 / [0.5 + 0.0833]
Te
= 1 / 0.5833
of
≈ 1.715
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
1. Which type of data hazard occurs when an instruction tries to read an operand before
another instruction writes it?
wa
a. RAW hazard
b. WAR hazard
c. WAW hazard
Gu
d. RAR hazard
The type of data hazard that occurs when an instruction tries to read an operand before
another instruction writes it is called a RAW hazard, which stands for "Read After Write"
gy
hazard. This situation arises when a dependent instruction (one that requires the result of a
lo
previous instruction) attempts to read data that has not yet been written by the preceding
instruction.
no
2. Which one of the following is true regarding pipelining in microprocessors?
ch
a. Pipelining reduces the latency of a single instruction.
b. Pipelining improves the throughput of a program.
Te
time period. While pipelining can reduce the latency of a single instruction (by dividing it
into stages and processing multiple instructions simultaneously), it does not directly
itu
increase the clock speed of the processor or reduce the execution time of an individual
instruction.
st
3. Which type of hazard occurs when different instructions, at different stages in the pipeline,
In
b. Control hazard
c. Structural hazard
di
d. Pipeline hazard
In
Structural hazards arise when there are not enough hardware resources (such as functional
units or memory ports) to accommodate the simultaneous execution of multiple
instructions in the pipeline. This can lead to delays and inefficiencies in the pipeline's
operation.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati
ti
4. Which one of the following is a characteristic feature of a typical RISC machine?
ha
a. Complex instructions with longer execution time.
wa
b. Memory-intensive instructions.
c. Load and store instructions.
d. Instructions with variable lengths.
Gu
RISC architectures are designed to have a small and optimized set of instructions, with a
focus on simple and efficient operations. In RISC architectures, instructions like
gy
arithmetic operations and logic operations are often performed directly on registers, and
memory operations are typically done using separate load and store instructions. This
lo
approach aims to simplify the instruction pipeline and improve overall performance by
reducing the complexity of individual instructions.
no
5. Which of the following statement is true with respect to an Instruction Fetch operation of a
ch
processor pipeline?
a. Contents from Instruction memory are transferred to ID/EX pipeline register
Te
d. Contents from ID/EX pipeline register are transferred to IF/ID pipeline register
During the Instruction Fetch stage of a pipeline, the next instruction is fetched from
te
memory and placed into the IF/ID pipeline register for further processing in subsequent
itu
pipeline stages.
st
a. input dependence
In
b. anti dependence
c. output dependence
d. true data dependence
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 2: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
The dependency between R2 of instruction I1 and R2 of instruction I3 is known as an anti
dependence. In an anti dependence, a later instruction (I3) depends on the output of an
wa
earlier instruction (I1) in such a way that if the earlier instruction's result is modified before
it is used, incorrect results could occur. In this case, I3 depends on the value of R2 from I1,
Gu
and if I2 modifies R2 between I1 and I3, it could lead to incorrect behavior.
7. The technique of separating dependent instruction from the source instruction by pipeline
gy
latency of the source instruction is called--------.
lo
a. instruction folding
b. compiler scheduling
c. operand forwarding
no
ch
d. bypassing
Operand forwarding, also known as data forwarding or result forwarding, is a technique
Te
used in pipelined processors to avoid data hazards. It allows the result of an instruction to
be forwarded directly to a dependent instruction, even before the result is written to the
of
register file or memory. This helps in reducing stalls and improving pipeline efficiency by
allowing instructions to proceed without waiting for the results of previous instructions.
te
itu
8. Assume an instruction pipeline with 5 stages namely IF, ID, EX, MEM and WB with
individual latencies 50 ns, 30 ns, 70 ns, 85 ns, and 40 ns, respectively. Pipeline latch
st
ti
combinations. What is the speedup of the pipelined design over the non-pipelined design?
ha
Correct to 2 decimal places. [2 marks]
Answer: Range: 1.55 to 1.58
wa
In the non-pipelined architecture, the CPI is 5.
Execution time for 1 instruction: 5*(1/1.5) ns = 3.33 ns
Gu
In the pipelined architecture, the CPI is calculated as follows:
CPI_p = Base CPI + Stall CPI
= 1 + (0.3 * 0.1 * 30) + (0.2 * 0.3 * 2) + (0.1 * 1)
gy
= 1 + 0.9 + 0.12 + 0.1 = 2.12
Execution time for 1 instruction: 2.12*1 ns = 3.33 ns
lo
The speedup is calculated as: Speedup = ET_non-pipelined / ET_pipelined
= 3.33/2.12 = 1.58
Allowed Range: Range: 1.55 to 1.58
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati
ti
1. Which one of the following statements is/are TRUE?
I. A one-bit predictor changes the prediction value for each mis-prediction.
ha
II. BPB stores the previous outcomes of the branch instruction.
III. (p,q) branch predictor uses the outcome of last p branches to index into the BPB where
wa
each entry has a q-bit predictor.
IV. If BTB can store one or more target instructions it can facilitate branch folding.
Gu
a. III only
b. I and II only
c. IV only
gy
d. I, II, III and IV
I. True - A one-bit predictor changes its prediction value for each mis-prediction by flipping
lo
a single bit.
no
II. True- The BPB (Branch Prediction Buffer) stores previous branch prediction outcomes.
III. True - In a (p,q) branch predictor, the outcomes of the last p branches are used to index
ch
into the BPB, which contains q-bit predictors for making predictions.
IV. True- The BTB (Branch Target Buffer) predicts the target addresses of branches and
Te
2. With respect to a MIPS multi-cycle floating point pipeline, which one of the following
statements is FALSE?
te
a. RAW dependency stalls can happen even after enabling operand forwarding.
b. Even after operand forwarding, there will be 3 stalls between a pair of adjacent
itu
d. Even after operand forwarding, there will be 7 stalls between a pair of adjacent
FMUL instructions that has a RAW dependency between them.
In
(Read-After-Write) dependency stalls by directly passing the required data from the
producing instruction to the dependent instruction. Therefore, the statement that suggests
di
there will be 3 stalls between adjacent FADD instructions with a RAW dependency, even
In
3. For filling the delay slot for a branch, an instruction is chosen from the target location of the
ti
branch if..........
ha
a. the outcome of the branch is irrelevant
b. the probability of branch not taken is very high
c. the probability of branch taken and not taken is same
wa
d. the probability of branch taken is very high
Choosing an instruction from the target location of the branch is most effective when the
Gu
branch is highly likely to be taken. This practice optimizes the pipeline performance by
allowing the execution of an instruction from the taken branch's target, improving overall
instruction throughput.
gy
lo
4. Branch Prediction Buffer with 64 rows is indexed by
a. outcome of last 16 branches
b. outcome of last 8 branches no
c. lower order 6 bits of the address of the branch instruction
ch
d. 64 bits of the physical address of the branch instruction.
Te
The lower-order bits of the address are used as an index to select the appropriate row in the
Branch Prediction Buffer. This index helps determine which prediction information
of
a. It uses the outcome of last p branches to index into the BPB where each entry has a
q-bit predictor.
st
b. It uses the outcome of last 2p branches to index into the BPB where each entry has a
q-bit predictor.
In
c. It uses the outcome of last p branches and last q bits of PC to index into the BPB to
an
A (p, q) type branch predictor uses the history of the outcomes of the last p branches to
index into a Branch Prediction Buffer (BPB). Each entry in the BPB is associated with a
q-bit predictor, which helps determine the predicted outcome of the current branch
instruction.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati
ti
6. What is the latency of the floating-point Multiplier Unit in a MIPS processor?
ha
a. 7
wa
b. 6
c. 4
Gu
d. 1
The floating-point Multiplier Unit in a MIPS processor generally has a latency of 6 cycles
gy
to complete its operation. This means that it takes 6 clock cycles for the multiplier unit to
produce the final result after receiving the input operands.
lo
no
7. Which one of the following branch handling approach allows a branch to take place after
one instruction following the branch instruction?
ch
a. Stall until branch direction is clear
b. Predict Branch Taken
Te
executed before the branch itself is taken. This allows the processor to use the time while
the branch decision is being resolved. It aims to minimize the impact of branch delays by
te
8. Among the listed operations, which one is not having a fully pipelined implementation in a
MIPS processor?
st
Floating Point Divide is a complex operation that involves multiple stages and
In
dependencies, making it difficult to achieve full pipelining due to the variable latency and
inter-stage dependencies. As a result, Floating Point Divide operations tend to have longer
latencies and are not as amenable to pipelining as the other operations listed.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 3: Detailed Solution
Indian Institute of Technology Guwahati
9. Consider a (2,2) type branch predictor. BHT is indexed by the outcome of the last 2
branches. The BPB is initialized for NN/NT/TN/TT as 00/00/11/11 and is indexed with an
ti
NN entry in the first reference. Consider the last 6 actual outcomes of a single static branch,
ha
{oldest N N T T T N latest} where T means branch is taken and N means not taken. What
will be the contents of BPB after the execution of the above mentioned 6 branch outcomes?
wa
[2 marks]
(A) 01/01/11/00
Gu
(B) 01/01/00/11
(C) 01/01/11/10
(D) 01/01/10/11
gy
lo
BPB Prediction Outcome Misprediction
S.No. Last
Outcome
NN/NT/TN/TT no Y/N?
ch
1. NN (Initial) 00/00/11/11 N N No
Te
2. NN 00/00/11/11 N N No
of
3. NN 00/00/11/11 N T Yes
te
itu
4. NT 01/00/11/11 N T Yes
st
5. TT 01/01/11/11 T T No
In
6. TT 01/01/11/11 T N Yes
an
01/01/11/10
di
In
The RED field indicates the entry referred by the predictor. Based on the prediction and outcome
there will be state transition, which is indicated in the next row in the same position. Therefore, at
the end of 6 branch outcomes, the BPB contents will be 01/01/11/10.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati
1. Loop unrolling results in ___
ti
a. increasing register pressure
ha
b. decreasing I cache miss
c. increasing number of control hazards
wa
d. increasing control dependence
Gu
Loop unrolling increase the register pressure because more variables are being processed
simultaneously, potentially requiring more registers to store intermediate values. This can lead to
register spills, where values are stored in memory due to a shortage of available registers.
gy
2. In Tomasulo's algorithm, register renaming is done using ___.
lo
a. Compiler scheduling
b. Reservation Station
c. Reorder Buffer
no
ch
d. Common Data Bus
Te
they're ready to execute. This helps execute instructions out of order, boosts functional unit usage,
and resolves data dependencies by holding the latest values regardless of instruction order. This
te
method enhances performance and efficiency in dynamic instruction scheduling and out-of-order
execution.
itu
1.
c. In a MIPS multi-cycle floating point pipeline that supports operand forwarding,
an
there will be six stalls between a pair of adjacent MUL instructions that have a
RAW dependency between them.
di
Both WAW and WAR dependencies can be easily resolved through proper instruction scheduling
techniques, and they don't result in incorrect program behavior. The true data dependency that
impacts program execution is the RAW (Read After Write) dependency, where an instruction
depends on the result of a previous instruction.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
4. Register Renaming can solve _____.
a. WAR and WAW hazards
wa
b. RAW hazard only
c. WAR hazard only
d. RAW, WAR, and WAW hazards
Gu
Register renaming can solve WAR (Write After Read) and WAW (Write After Write) hazards. This
technique assigns temporary names to architectural registers, allowing instructions to use the same
register names without causing data hazards. This eliminates potential conflicts that could arise
gy
when instructions read from or write to the same registers.
lo
5. Which of the following is the best match from Set A to Set B?
Set A
no Set B
ch
W. Static Scheduling 1. Reorder buffer
Te
order at compile-time, similar to how loop unrolling optimizes loops for better performance.
X → 4 (Operand Forwarding → Reservation Station): Operand forwarding allows instructions to
In
use results directly, much like reservation stations store operands for instructions in an out-of-order
pipeline.
Y → 1 (Speculative Dynamic Scheduling → Reorder Buffer): Speculative dynamic scheduling
involves executing instructions speculatively, aligning with the concept of a reorder buffer that
handles out-of-order instruction commitment.
Z → 3 (CDB Writing → RSI Update): Common Data Bus (CDB) transfers data, similar to how
Return Stack Buffer (RSB) updates handle speculative execution outcomes.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
6. In a dynamically scheduled processor that supports speculation, if the register status
wa
indicator of a register Rx is 0, then _____.
a. the latest value of Rx can be obtained from entry #0 in the reorder buffer.
b. the latest value of Rx will be produced by functional unit #0.
Gu
c. the latest value of Rx can be obtained from entry #0 in the reservation station.
d. the latest value of Rx is available in the Register File.
If the register status indicator of register Rx is 0 in a dynamically scheduled processor with
gy
speculation, it means that the latest value of Rx is available in the Register File and can be directly
lo
accessed without any further delays or dependencies.
no
7. Consider an ADD instruction with first operand as Rx and second operand as Ry that is to
be executed in a dynamically scheduled processor that follows Tomasulo’s algorithm. When
ch
the instruction is issued, the seven-tuple entry {Op, Qj, Qk, Vj, Vk, A, Busy} in the
reservation station for this instruction is {ADD, 2, 0, 0, 2, A, 1}. Which of the following
Te
The correct option is that Rx value is available from the output of functional unit #2, and Ry value
is 2. This means the first operand (Rx) of the ADD instruction comes from the result of functional
st
unit #2, and the second operand (Ry) has a value of 2 available.
In
8. Suppose, a load and a store access the same address. If in program order the store appears
an
before the load, interchanging them in execution order can create ____ hazard.
a. RAW
di
b. WAR
c. WAW
In
d. No
RAW hazard occurs when a read instruction depends on the result of a preceding write instruction.
If the load instruction is executed after the store instruction, it might read the old value from
memory instead of the updated value written by the store.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 4: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
wa
9. Consider an instruction pipeline with an issue width of 1 that uses Tomasulo's algorithm
with one reservation station per functional unit. There is one Integer MUL unit, one Integer
DIV unit, and one Integer ADD unit, all connected to a single CDB. The functional units
Gu
are not pipelined. An instruction waiting for data on CDB can move to its EX stage in the
cycle after the CDB broadcast. The instructions are:
gy
i. I1: ADDI R1, R1, #8
ii. I2: DIV R3, R2, R1
lo
iii. I3: MUL R4, R1, R3
iv. I4: DIV R5, R4, R1 no
ch
Assume the following information about functional units.
Te
Answer: 17
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati
1. In a typical GPU kernel execution, which of the following statements is/are FALSE?
a. Threads of the same block can share data.
i
at
b. Data transfer from device to host memory happens after GPU kernel execute.
c. GPU threads can access contents of host memory directly.
ah
d. Data transfer from host to device memory happens before GPU kernel
execute.
uw
Threads of the same block can share data. - True.
Threads within the same block can communicate and share data using shared memory.
G
Data transfer from device to host memory happens after GPU kernel executes. -
gy
True.
Data is typically transferred from the GPU device memory to the host memory after
lo
the GPU kernel execution is complete to retrieve results.
no
GPU threads can access contents of host memory directly. - False.
ch
GPU threads cannot directly access host memory; data must be transferred to GPU
memory first.
Te
Data transfer from host to device memory happens before GPU kernel executes.
of
- True.
Data is transferred from host to GPU device memory before the GPU kernel
te
b. Hyper threading issues instruction from more than one instruction stream per
slot.
an
Hyper-Threading (HT) enables a single CPU core to switch rapidly between different
In
threads, allowing it to issue instructions from more than one instruction stream in a
given clock cycle. This improves resource utilization and can enhance performance in
multi-threaded scenarios.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati
i
at
ah
3. Which one of the following is FALSE with respect to a superscalar processor?
a. CPI will be ideally less than 1.
uw
b. It can support multiple instruction issue per clock cycle.
c. There will be multiple functional units, but only one of them can be in busy at
G
any given pointing time.
d. There is operational support for fetching more than one instruction per clock
gy
cycle.
In a superscalar processor, it is not true that only one functional unit can be busy at
lo
any given time. Superscalar processors are designed to have multiple functional units
no
that can work simultaneously to execute multiple instructions in parallel.
ch
4. Which one of the processor execute instruction bundle created by compiler that
exploited parallelism in code?
Te
a. Scalar processors
b. VLIW processors
of
c. Speculative processors
d. SIMD processors
te
VLIW (Very Long Instruction Word) processors execute instruction bundles that have
been explicitly scheduled by the compiler to take advantage of parallelism in the code.
itu
parallelism to the compiler rather than relying on complex hardware mechanisms like
in superscalar processors.
an
5. Which execution model is used in a GPU, where each thread executes the same code
di
e. SIMD
f. SISD
g. MIMD
h. MISD
In a SIMD (Single Instruction, Multiple Data) execution model, a single instruction is
executed simultaneously by multiple threads (or processing elements) on different
data elements. This approach is well-suited for tasks that involve applying the same
operation to multiple data items in parallel, which is a common scenario in graphics
processing on GPUs.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 5: Detailed Solution
Indian Institute of Technology Guwahati
i
at
ah
6. In a GPU, which one of the following statement is TRUE wrt memory coalescing?
a. Maximum throughput happens when threads in adjacent warps access same
uw
cache line at a time.
b. Maximum throughput happens when threads in a warp access same cache line
at a time.
G
c. Maximum throughput happens when all threads in a warp access adjancent
rows in memory at a time.
gy
d. Maximum throughput happens when threads in a warp access adjacent cache
lines at a time. lo
no
Memory coalescing in a GPU aims to minimize memory access latency and maximize
memory throughput by ensuring that threads in a warp access memory locations in a
ch
contiguous and aligned manner. Accessing the same cache line at a time by threads in
a warp allows for efficient memory transactions and improved data transfer rate
Te
process the display. What is the minimum number of such processors required to
te
Total instructions per frame = Number of pixels × Instructions per pixel = 1600 ×
1000 × 50 = 80,000,000
di
Total instructions per second = Total instructions per frame × Refresh rate =
In
i
at
8. Given an image A represented as a 12x12 pixel matrix. An operation is done on A
by a GPU that is using 2D blocks having 4 threads per block. Consider a pixel P
ah
whose blockIdx.x=3, blockIdx.y=2, threadIdx.x=0, and threadIdx.y=1. If the image A
is stored in row major format in memory from location A[0] to A[143], what is the
uw
index of P in the array?
Ans: 66
G
blockDim.x = blockDim.y = 2 (2*2 = 4, as there are 4 threads per block.)
Row = blockIdx.y * blockDim.y + threadIdx.y = 2*2+1 = 4+1 = 5
gy
Col = blockIdx.x * blockDim.x + threadIdx.x = 3*2+0 = 6
lo
Therefore, Index = row*dim+col = 5*12+6 = 60+6 = 66
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati
1. Which one of the following statements is TRUE wrt a m-way set-associative cache
memory organization?
i
at
a. Every cache block in a set will have a tag field.
b. There is only one tag field for each set.
ah
c. Tag comparison happens sequentially from way 0 to way m-1.
d. Cache hit time is dependent on the way in which tag matching happens.
uw
In an m-way set-associative cache, each set consists of m cache blocks, and each
block in the set will have a corresponding tag field. This allows the cache to store
G
multiple blocks in each set and perform parallel tag comparisons when looking up
data in the cache. The tag fields are used to determine whether the requested data is
gy
present in the cache (cache hit) or not (cache miss).
lo
2. The word length of the processor is 16 bits.The address of the first byte of a word in a
no
byte addressable 1 MB physical memory is 0xAB8F2. This word upon bringing to the
cache is mapped to set 30. How many words can be accommodated in each cache
ch
block?
a. 4
Te
b. 8
c. 16
of
d. 32
30 in binary is 11110
te
2^2 = 4
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati
3. Consider a system with 8 KB direct mapped data cache with a block size of 64 bytes.
The system has a physical address space of 64 KB with a word length of 16 bits. How
i
many bits are required to represent the tag field in a cache block?
at
a. 7 bits
ah
b. 5 bits
c. 6 bits
uw
d. 3 bits
#of sets = cache size/ (block size x # of ways/set) = (213) / (26 x 1) = 27 sets.
G
#bits representing set Index = 7 bits
gy
64 KB physical address space memory → 16 bits physical address
Tag=3 bits Index=7 bitslo Offset=6 bits
no
4. Which one of the following statements is TRUE for a write miss in no write allocate
ch
caches?
a. The block containing the missed word is brought to the cache for writing and
Te
c. Write the missed word in the main memory and then immediately bring the
modified block to the cache.
te
d. The block containing the missed word is brought to the cache for writing and
itu
allocate a block for the write operation. Instead, it directly writes the data to the main
memory (also known as write-through) without bringing the entire block into the
In
cache. This approach avoids unnecessary cache block allocations for write operations
and ensures that the main memory always contains the most up-to-date data.
an
di
5. When a processor requests data from memory, the cache is checked first. Upon
encountering a miss, the cache is loaded first from memory and then the processor is
In
6. How many conflict misses are encountered when FIFO cache block replacement
technique is used with a 4-way set associative cache for the following block access
i
at
pattern? Assume initially the cache is empty.
ah
P, Q, R, S, T, P, Q, S, R, T, Q, P
a. 1
uw
b. 3
c. 5
G
d. 0
gy
Block P Q R S lo T P Q S R T Q P
no
Way-1 P P P P T T T T T T T T
ch
Way-2 - Q Q Q Q P P P P P P P
Te
Way-3 - - R R R R Q Q Q Q Q Q
of
Way-4 - - - S S S S S R R R R
Therefore, the total misses are 8 (including 5 compulsory misses and 3 conflict
te
misses).
itu
mapped D-cache with a block size of 16 bytes. The program reads 4 data words A, B,
C and D in that order 5 times (total 20 memory references). Let the physical addresses
In
c. At the end of 20 memory references, A, C and D are located inside the cache.
d. Every access to C will result in eviction of B from the cache.
#of sets = cache size/ (block size x # of ways/set) = (212) / (24 x 1) = 28 sets.
16MB main memory → 24 bits physical address
i
at
Mapping:
A → 0x420424 Tag=0x420, Set Index: 0x42
ah
B → 0x74042A Tag=0x740, Set Index: 0x42
uw
C → 0x740664 Tag=0x740, Set Index: 0x66
D → 0x74066D Tag=0x740, Set Index: 0x66
G
gy
A and B are mapping to the same set, but they are part of different blocks (tag is
different). Since the cache is direct mapped, A and B will have conflict misses
lo
between them as one will evict the other. C and D have the same tag and set index. So,
no
they are part of the same block and will exist together in the same block without any
conflict. A, B, and C will have compulsory miss and D will not have compulsory miss
ch
as bringing C will automatically bring D also. Hence access to D will always be a hit.
Te
Each access to A and B will be a miss and first access to C will be a miss. So a total of
5+5+1= 11 misses. At the end, B, C and D will be in cache.
of
1000 × 50 = 80,000,000
Total instructions per second = Total instructions per frame × Refresh rate =
itu
i
Pseudo LRU block replacement technique. Assume that set n is initially empty. What
at
will be the contents of set n (in the order way0, way1, way2 and way3) after servicing
ah
all the requests? [Assume that data is entered into an empty cache block from way-0,
way-1, way-2 & way-3 order]
uw
a. ECDA
b. ECAD
G
c. EDCA
d. BEDC
gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In
Total access = 13
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 6: Detailed Solution
Indian Institute of Technology Guwahati
i
at
ah
uw
G
gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati
i
at
b. a decrease in compulsory misses and conflict misses.
c. an increase in compulsory misses and conflict misses.
ah
d. a decrease in compulsory misses and increase in conflict misses.
uw
Increasing block size generally leads to:
A decrease in compulsory misses: Larger blocks fetch more data, reducing the
G
frequency of fetching new blocks from main memory.
An increase in conflict misses: Larger blocks increase the likelihood of multiple
gy
memory addresses mapping to the same cache set, causing cache conflicts and
evictions. lo
no
2. Which one of the following optimization reduces the cache miss penalty?
a. Pipelined caching
ch
b. Multi-level caching
Te
c. Way prediction
d. Multibanked caching
of
accessing data directly from main memory. As you move down the cache hierarchy,
itu
the cache sizes typically increase, providing a balance between low-latency access for
frequently used data and larger capacity for less frequently used data.
st
In
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati
i
at
b. Early restart and critical word first techniques reduce miss penalty.
c. Harware prefetching reduces cache hit time.
ah
d. Non-blocking cache results in increased cache bandwidth.
uw
Hardware prefetching is a technique used to anticipate future memory accesses and
fetch the required data into the cache before it's actually needed. While prefetching
G
can help in reducing cache miss penalties by ensuring that the data is already in the
cache when needed, it doesn't directly reduce the cache hit time. Prefetching aims to
gy
mitigate cache miss penalties rather than accelerating cache hit times.
lo
no
4. Which one of the following statements is TRUE?
a. Way prediction technique reduces miss penalty in caches.
ch
The conflict miss rate is low in a direct-mapped cache compared to a set associative
cache of similar cache configuration as set associative can accommodate number of
st
potentially allowing for faster cache operation and data throughput. - TRUE
di
In
5. The average memory access time for a memory hierarchy system with one level of
cache and a main memory is 6 ns. The hit time and miss penalty of the cache is 2 ns
and 100 ns, respectively. The hit rate of the cache (round off to two decimal places) is
e. 0.94
f. 0.96
g. 0.02
h. 0.04
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 7: Detailed Solution
Indian Institute of Technology Guwahati
AMAT = 6 ns
Hit Time = 2 ns
i
at
Miss Penalty = 100 ns
ah
6 ns = 2 ns + (Miss Rate * 100 ns)
Miss Rate * 100 ns = 6 ns - 2 ns
uw
Miss Rate * 100 ns = 4 ns
(since hit rate + miss rate = 1):
G
Hit Rate = 1 - Miss Rate
Hit Rate = 1 - 0.04 = 0.96
gy
lo
no
6. Assume an L1 cache with a hit rate of 85%, and an L2 cache with a local miss rate
of 4%. If there are 1500 memory access initiated by CPU, then the number of memory
ch
Ans: 216
of
Number of access that will miss in L1 cache = (Miss rate_L1 * # memory access to
L1) = 0.15*1500 = 225
te
7. A cache has a hit time of 10 ns and hit rate of 60%. An optimization was made to
an
increase hit rate to 70% but the hit time was increased to 15 ns. The optimization
resulted in a 10% reduction in average memory access time. Assume that the miss
di
penalty is unaffected by the optimization. The miss penalty of the cache (in ns) is
_____.
In
AMAT_opt/AMAT_old = 0.9
0.9 = (15 + (0.3 * x) / (10 + (0.4 * x))
i
at
x = 100
Miss Penalty_L1 = 100 ns
ah
uw
8. A 32-bit word processor is connected to a 16 KB, 4-way set-associative L1 cache
having a block size of 32 B. Total physical address space is 256 MB. When an L1
G
cache miss occurs, it takes 25 cycles to fetch the first word of a block from L2 cache
and 4 cycles for each subsequent word in the block. Assume that processor is stalled
gy
due to an L1 cache miss occurred on a word whose first byte address is 0x3416ACC.
Assume that the word is a hit in L2 cache. How many cycles will the processor stall
lo
before it resumes execution if an early restart optimization done on L1 cache.?
no
Ans: 37 [range 37 to 37]
ch
1 word= 4B
Block Size = 32B🡪 # words/blocks = 32/4 = 8 = 3 bits
of
0x 3 4 1 6 A C C
L1🡪 0011 0100 0001 0110 1010 1100 1100 86 word→ 3 (4th word)
In
i
at
b. Memory consistency provides local ordering of accesses to all words in a
cache block.
ah
c. Cache coherence provides local ordering of accesses to each cache block.
uw
d. Write serialization ensures that writes to the same location are globally
ordered.
G
Memory consistency provides local ordering of accesses to all words in a cache
block. This statement is false. Memory consistency primarily focuses on the order in
gy
which memory operations are perceived by different threads, not on the local ordering
lo
of accesses to specific words within a cache block.
no
2. What is the purpose of Write Propagation in memory consistency and cache
ch
coherence?
a. It delays write operations to ensure coherence.
Te
b. It ensures that the value written in one cache is propagated to at least one
sharer in a predetermined order.
of
Write propagation ensures that when a write operation is performed, the modified data
is eventually made visible to all threads in a multi-threaded system. This is essential
st
for maintaining memory consistency and ensuring that all threads have a coherent and
In
up-to-date view of the data. Write propagation helps prevent data inconsistencies and
synchronization issues in multi-threaded programs.
an
di
In
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati
3. Which one of the following protocols ensures that a cache controller sends broadcast
i
at
messages in a common medium for other cache controllers connected to it for taking
appropriate cache coherence operations?
ah
a. Write serialization protocols
uw
b. Snoopy protocols
c. Directory based protocols
G
d. Consistency protocols
Snoopy protocols are a class of cache coherence protocols in which each cache
gy
controller monitors or "snoops" the common communication medium (usually a
shared bus) for transactions initiated by other cache controllers. When a cache
lo
controller observes a transaction that may affect its own cache's data, it takes
no
appropriate cache coherence actions, such as invalidating or updating its cache line to
ch
maintain data consistency. These protocols rely on broadcast messages to keep caches
coherent and are efficient for smaller-scale systems.
Te
of B to P1, what will be the state transition done by P2 on the cache block B?
itu
processor requests the same data (encounters a read miss), and the processor that had
di
it in the Modified state has to relinquish its exclusive ownership. Changing the state to
Shared means that multiple processors can have the data in their caches in a
In
non-exclusive manner.
5. Which of the following are the advantages of using a directory based cache coherence
protocol over a snooping based cache coherence protocol? [Multiple correct
answers]
A. Reduced cache size requirements for a given workload
B. Less contention in accessing the directory
C. Elimination of broadcast messages
D. Scalability of processors attached to the interconnect
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati
i
at
Less contention in accessing the directory: Directory-based protocols have lower
ah
contention for shared resources, improving system performance.
Scalability of processors attached to the interconnect: They are better suited for large
uw
multiprocessor systems where snooping-based protocols may become inefficient due
to increased complexity and bus congestion.
G
Reduced cache size requirements: is not a typical advantage of directory-based
protocols.
gy
Elimination of broadcast messages: is not entirely true, as some messaging or
lo
directory access is still needed in directory-based protocols.
no
6. If two co-operating processors P1 and P2 write to two different words W1 and W2,
ch
respectively of a cache block B. The system uses directory cache coherence protocol. Which
of the following statements is/are TRUE? [Multiple correct answers]
Te
of
There exists false sharing because P1 and P2 are accessing different words within the
an
block B can keep bouncing between P1 and P2 due to continuous state transitions
In
caused by interleaved accesses. This is known as cache line bouncing or thrashing and
can result in inefficient use of system resources.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati
i
at
7. Consider a directory based coherence system for a 256 GB physical address space.
A 16-core processor is connected to this physical address. Each core has a 128 KB,
ah
4-way set-associative private cache memory of block size 256 bytes. Assume that the
central directory will store information of most frequently used 1024 cache blocks
uw
only. Each directory entry will store the state (2 bits to represent one of the 3 states :
E, U and S), block number and list of shares (1-bit per core). What is the storage
G
space consumed by the directory in bytes? [2 marks]
a. Ans: 6144
gy
Solution: # blocks in this system = 238/28 = 230
lo
Bits in each directory entry: 2 (state) + 30 (block number) + 16 (1-bit per core) = 48
Each directory entry is 48 bits → 6 bytes.
no
Directory stores information about most frequently used 1024 cache blocks.
ch
8. Consider a multi-processing system with two cores A and B with their own private
caches and a single shared main memory using MESI cache coherence protocol. The
of
2. LW R2, M2
itu
3. SW R3, M2
4. SW R2, M1
st
The addresses pointed by M1 and M2 map to different cache blocks. Consider the
In
following execution sequence in the format; Core-Instruction Number. A-1, A-2, B-2,
B-1, B-3, A-3, B-4, A-4. If the state of 4 blocks (M1 in A, M2 in A, M1 in B, M2 in
an
these blocks after the execution of the above 8 instruction sequences? [2 marks]
In
a. MMII
b. SSSS
c. MSIS
d. ISMI
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 8: Detailed Solution
Indian Institute of Technology Guwahati
i
at
ah
Sequence:
uw
A-1: LW R1, M1
A-2: LW R2, M2
G
B-2: LW R2, M2
B-1: LW R1, M1
gy
B-3: SW R3, M2
A-3: SW R3, M2 lo
B-4: SW R2, M1
no
A-4: SW R2, M1
ch
Te
Core- Ins M1 in A M2 in A M1 in B M2 in B
of
initial I I I I
Answer: MMII
A-1 E I I I
te
A-2 E E I I
itu
B-2 E S I S
st
B-1 S S S S
In
B-3 S I S M
an
A-3 S M S I
di
B-4 I M M I
In
A-4 M M I I
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati
1. Which component of access time in Hard Disk dominates for very short seeks (2-4
cylinders)?
i
at
a. Settle time
b. Coast time
ah
c. Speedup time
uw
d. Slowdown time
G
During short seeks, where the heads are moving only a small distance, the time it
takes for the heads to settle in the correct position can be a significant portion of the
gy
total access time. This is because the heads need to overcome mechanical inertia and
vibrations to accurately position themselves over the desired data track.
lo
no
2. In a DRAM system that follows open row buffer management policy, which of the
following sequence of commands are generated if the new request is to a different row
ch
a. Activate
b. Precharge followed by CAS
of
In a DRAM system with open row buffer management, when accessing a different
memory row than the last one, the following sequence occurs:
st
Precharge: The current row is reset to prepare for the next access.
In
Activate: The desired memory row is selected and its data is copied into the row
buffer.
an
CAS (Column Access Strobe): The specific data from the row buffer is accessed.
di
i
at
ah
3. Which of the following is NOT a function of the DRAM controller?
a. Translate memory requests to DRAM command sequences.
uw
b. Manage power consumption and thermals in DRAM.
c. Buffer and schedule incoming memory requests.
G
d. Reorganizing the stored data in DRAM for better space utilization
gy
DRAM controllers primarily handle tasks such as translating memory requests to
DRAM command sequences, managing power consumption and thermals in DRAM
lo
(to some extent), and buffering/scheduling incoming memory requests. Reorganizing
no
data for better space utilization is typically a function of higher-level memory
management or file systems and is not a direct responsibility of the DRAM controller.
ch
Disk scheduling algorithms aim to reduce the time it takes for the read/write heads of
itu
a hard disk drive to seek (move) from their current position to the desired track or
cylinder where data needs to be read or written. Minimizing seek time helps improve
st
6. A 64 GB DRAM system that uses 4 channels (C0, C1, C2 and C3) has 2048
columns per row. It uses 64-bit wide memory bus to transfer data from DRAM to the
i
at
processor. If adjacent memory words are mapped on to adjacent memory channels,
which channel will fetch the physical address 0x2953A1B5C?
ah
a. C0
b. C1
uw
c. C2
G
d. C3
gy
64GB DRAM- 36-bit physical address
If adjacent memory words are mapped on to adjacent memory channels, then
lo
channel bits will be just before the last 3 bits (Byte within bus).
no
The address split up is as follows.
Rest (rank+ row) column channel Byte within bus
ch
11 2 3
Te
7. A 4 GB hard disk that has only 1 magnetic surface for storing data has 256
te
cylinders and there are 128 sectors per track. If all sectors/cylinder are storing same
amount of data, the maximum size of a file that occupies 8 sectors of a cylinder in KB
itu
is ______.
st
Ans: 1024
Size of hard disk = 4GB = 2^32
In
= 2^20 = 1024KB
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 9: Detailed Solution
Indian Institute of Technology Guwahati
i
at
8. A disk drive has 200 cylinders numbered from 0 to 199. The disk arm is initially
positioned at cylinder 50. There are now five pending disk requests (cyliner numbers)
ah
in the queue: 72, 55, 40, 90, 5. Calculate the total head movements to service all these
requests using the SSTF disk scheduling algorithm.
uw
G
Answer: 155
Order is 50, 55, 40, 72, 90, 5
gy
Cylinder is at 50 initially
lo
Move from cylinder 50 to cylinder 55: 55 - 50 = 5 movements
no
Move from cylinder 55 to cylinder 40: 55 - 40 = 15 movements
Move from cylinder 40 to cylinder 72: 72 - 40 = 32 movements
ch
Total= 5+15+32+18+85=155
of
9. Consider a 1MB DRAM on a single DIMM with two ranks, 16 banks (named as
B0, B1, B2.. B15) per rank and 32 columns per row. The data bus width is 16 bytes.
te
The addressing uses row interleaving. Which of the following physical addresses is
itu
b. 0x55587
In
c. 0x65B24
d. 0x578B5
an
In row interleaving the bank bits are between column and row bits.
The address split up is as follows.
In
1. What is the basic unit of flow control between a pair of adjacent routers in an NoC?
a. Crossbar
i
at
b. Packet
c. Buffer
ah
d. Flit
uw
The basic unit of flow control between a pair of adjacent routers in a
Network-on-Chip (NoC) is typically a "Flit". Flits are smaller than traditional packets
G
and are used to control the flow of data through the NoC by dividing data into smaller,
manageable units. They help in routing data efficiently and managing congestion in
gy
on-chip networks.
2. lo
What does the term topology specify in on-chip networks?
no
a. The way routers are connected
b. The routing algorithm used
ch
blueprint for how data travels within the chip. Different topologies have different
te
3. When we use source routing in on-chip networks, where is the routing information
st
stored?
In
In on-chip networks that use source routing, the routing information is typically stored
In
in the "packet header." Each packet contains the necessary information to specify the
route it should take through the network. This information is placed in the header of
the packet, and routers use it to determine the path the packet should follow to reach
its destination.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati
i
packet received in its input port to an appropriate output port.
at
a. Route Computation
ah
b. Buffering of Flits
c. Switch Allocation
uw
d. VC Allocation
e. Switch Traversal
G
f. Link Traversal
gy
Which of the following is the correct sequence in which the functions are performed?
a. i-ii-iii-iv-v-vi
b. ii-i-iv-iii-v-vi
lo
no
c. ii-i-iii-iv-v-vi
d. v-ii-i-iv-iii-vi
ch
to forward a packet received in its input port to an appropriate output port is:
ii - Buffering of Flits (Incoming flits are buffered to wait for further processing.)
of
i - Route Computation (The router determines the route the packet should take.)
iv - VC Allocation (Virtual Channels are allocated for the packet.)
te
iii - Switch Allocation (The router decides which output port to forward the packet
itu
to.)
v - Switch Traversal (The packet is sent through the switch or crossbar to the
st
vi - Link Traversal (The packet travels across the link to the next router or
destination.)
an
5. In an NoC router, if an incoming packet has more than one potential output port
possible as per the adaptive routing algorithm, one output port is finally chosen by
____
g. input selection strategy
h. spatial scheduling
i. output selection strategy
j. VC allocation
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati
One output port is finally chosen by the "output selection strategy" if an incoming
i
at
packet in a NoC router has more than one potential output port possible according to
ah
the adaptive routing algorithm. The output selection strategy determines which
specific output port the packet should be forwarded to based on factors such as
uw
congestion, availability, and other routing criteria.
G
6. Which one of the following statements is FALSE?
gy
a. XY routing is minimal and always deadlock free.
b. Odd even routing is adaptive but not deadlock free.
lo
c. East first routing can be non-minimal.
no
d. North last routing is deadlock free.
ch
7. In a 6x6, 2D-mesh network on chip, the number of routers in the network that are
te
Ans: 16
st
In
1 2 3 4
an
5 6 7 8
di
9 10 11 12
In
13 14 15 16
In a 6x6 2D-mesh network on chip, except boundary routers all other routers are
directly connected to its four neighboring routers (north, south, east, and west) within
the mesh.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati
8. Consider an 64-tile system that uses a square mesh NoC topology where routers
i
follows minimal odd-even routing. The packet P1 travels from router 18 to 36, how
at
many unique paths exist for this packet to reach its destination?
ah
Ans: 3
uw
G
gy
56 57 58 59 60
lo
61 62 63
no
48 49 50 51 52 53 54 55
ch
40 41 42 43 44 45 46 47
Te
32 33 34 35 36 37 38 39
of
24 25 26 27 28 29 30 31
te
16 17 18 19 20 21 22 23
itu
8 9 10 11 12 13 14 15
st
0 1 2 3 4 5 6 7
In
2: 18->26->27->35->36
di
3: 18->26->34->35->36
In
Answer: 20
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 10: Detailed Solution
Indian Institute of Technology Guwahati
56 57 58 59 60 61 62 63
i
at
48 49 50 51 52 53 54 55
ah
40 41 42 43 44 45 46 47
uw
32 33 34 35 36 37 38 39
24 25 26 27 28 29 30 31
G
16 17 18 19 20 21 22 23
gy
8 9 10 11 12 13
lo 14 15
no
0 1 2 3 4 5 6 7
ch
i
at
b. To store deflected flits temporarily
c. To store flits that are golden
ah
d. To reduce port conflict
uw
In side-buffered deflection routers, the role of side buffers is to "store deflected flits
temporarily." When a router experiences congestion and cannot immediately forward
G
a flit to its intended output port, it may deflect the flit to a side buffer. The side buffer
temporarily holds the deflected flits until they can be transmitted when the output port
gy
becomes available. This helps in reducing congestion and preventing packet loss in
lo
the network by allowing flits to be stored temporarily.
no
2. What is hot-potato routing?
ch
Hot-potato routing is "routing based on any available output port" rather than waiting
for an optimal or congestion-free route. It is used to reduce latency and prevent
te
3. At any given point in time, the maximum number of silver flits in a 5x5 mesh NoC
realized using MinBD routers is ____.
i
at
a. 1
b. 5
ah
c. 25
uw
d. 10
In MinBD routers there can be only 1 silver flit at each router at any point of time.
G
Therefore, the maximum number of silver flits in a 5x5 mesh NoC realized using
MinBD routers is 5*5 = 25
gy
4. Which of the following is TRUE with respect to buffer-less deflection router
lo
CHIPPER?
no
a. One packet per router is identified as golden and golden packets are never
deflected.
ch
b. Ejection stage is kept after the inject stage in the router pipeline.
Te
determine the output port for each incoming flit, and it's done in parallel to allow flits
itu
to be routed without buffers, which helps in reducing latency and complexity in the
router.
st
In
buffer.
b. At most two flits can be ejected per router per cycle.
di
c. In the router pipeline, the buffer inject unit is kept after the buffer eject unit.
In
d. Once a flit becomes silver, it is no longer deflected till it reaches its destination
i
at
a. CHIPPER router uses golden packet scheme for flit prioritization
b. BLESS uses sequential port allocation logic
ah
c. MinBD uses quadrant routing algorithm
uw
d. DeBAR has single ejection port.
G
MinBD routers are typically associated with deflection routing and are used to reduce
bisection congestion in network-on-chip (NoC) designs. While the specific routing
gy
algorithm in MinBD routers can vary, they usually do not use the quadrant routing
algorithm. Instead, they may employ adaptive or deflection routing strategies. So, the
lo
statement regarding the use of the quadrant routing algorithm is false.
no
ch
a. SLIDER
b. CHIPPER
of
c. DeBAR
d. MinBD
te
DeBAR is a specific router architecture designed for on-chip networks and uses
itu
a. SLIDER
di
b. CHIPPER
In
c. DeBAR
d. MinBD
SLIDER uses both restrictive and non-restrictive injection techniques. Restrictive
injection only allows flits to enter the network when a clear path is available, while
non-restrictive injection permits flits to enter even if they may need to be deflected.
This helps SLIDER manage network congestion and improve routing efficiency in
on-chip networks.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 11: Detailed Solution
Indian Institute of Technology Guwahati
i
at
The details of the packets (Packet number, Golden, Input Port, Destination) <P1, Yes,
ah
S, 10>, <P2, No, W, 14>, <P3, No, E, 9> and <P4, No, N, 6>. Tie between two
non-golden flits are resolved using packet number. Higher the packet number, higher
uw
the priority. How many packets get productive output port in PDN?
G
Answer: 2
gy
lo
no
ch
Te
of
te
itu
st
In
an
di
In
In
di
an
In
st
itu
te
of
Te
ch
no
lo
gy
G
uw
ah
at
i
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
1. The number of cycles a packet can be delayed in the network without reducing application’s
performance is known as _____.
wa
a. Packet latency
b. Network stall time
c. Slack
Gu
d. Critical time
A slack of a packet is defined as the number of cycles it can be delayed in the network
gy
without reducing the application’s performance.
lo
2. Intel KNL has _____ tiles in 2D mesh.
a. 8
b. 36
no
ch
c. 64
Te
d. 16
Intel KNL has 36 Tiles interconnected by 2D Mesh
of
3. Which one of the following emerging NoCs uses the concept of diverting light to a certain
te
a. 3D NoC
st
b. Nanophotonics
In
d. Vertical interconnect
di
Nanophotonics uses a microring resonator which diverts light of a certain wavelength when
a voltage is applied.
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
4. Which one of the following is a 64-bit dual core VEGA processor?
wa
a. VEGA AS1061
b. VEGA AS2161
Gu
c. VEGA AS4161
d. VEGA AS1161
VEGA AS2161: VEGA AS2161 64-bit dual core 16-stage pipeline out-of-order RISC-V
gy
processor. Refer this for details: https://vegaprocessors.in/vega.php
lo
5. Which one of the following statements is TRUE about layerwise DNN computation done on
a TCMP system?
no
ch
a. Filter/weights are moved from global buffer to off-chip memory
b. Output feature map is progressed from off-chip memory to global buffer
Te
As shown above, layerwise DNN computation on a TCMP system involves the movement
di
6. Consider a 16 x16 Cmesh NoC structure in which each node is connected to 4 processing
cores. The entire mesh structure is divided into 16 Wcubes. Each Wcube is marked with a 4
bit number. [Wcube- 0 (0000) to Wcube- 15 (1111)] Identify the correct Wcube to which
Wcube-15 can be directly communicated?
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number - 12: Detailed Solution
Indian Institute of Technology Guwahati
a. Wcube- 9
ti
b. Wcube- 13
ha
c. Wcube- 10
wa
d. Wcube- 8
Gu
Direct communication is possible only with a Wcube that is at hamming distance 1 (they
differ only at one bit position). Here 15 (1111) and 13 (1101) only differ in one bit position.
gy
lo
7. Consider a TCMP system with 64 tiles, where each tile consists of a superscalar processor,
no
a private L1 cache and a shared distributed L2 cache. The total L2 cache on the chip is
16MB and L2 uses 64B blocks and is 8-way associative. Each L2 cache slice on-chip has
ch
all the 8 ways of the sets assigned to it. The L2 cache memory per tile division is such that
total sets in L2 cache are uniformly partitioned across tiles in a sequential fashion. The
Te
system uses a 32-bit physical address. How many L2 cache sets are mapped per tile?
of
8. Consider a TCMP system with a 4x4 mesh NoC where each tile consists of a superscalar
an
processor, a private L1 cache and a shared distributed L2 cache. Let T0, T1, T2... ,T15
di
corresponds to the tiles where T0 is the bottom left tile and T15 the top right tile. Each tile
has a 16KB 2-way associative L1 cache with a block size of 16B. The total L2 cache on the
In
chip is 32MB and L2 uses 128B blocks and is 16-way associative. Each L2 cache bank has
all the 16 ways of the sets assigned to it. The L2 cache memory per tile division is such that
total sets in L2 cache are uniformly partitioned across all tiles in sequential fashion. The
system uses a 40-bit physical address. T4 generated an L1 cache miss for the address
A1=0x A8CD210652. As per L2 set mapping, tile Tx host the L2 set for A1. What is the
value of x? (Hint: Possible value of x ranges from 0 to 15).
NPTEL Online Certification Course
Multi-Core Computer Architecture
Assignment Number -12: Detailed Solution
Indian Institute of Technology Guwahati
ti
ha
Correct answer: 0
wa
L2 size per tile= 32MB / 16 = 2MB = 221Bytes
Gu
No of sets in one L2 cache= (221) / (27x 24) = 210
So, the address distribution is: tag =19, tile=4, set index within tile 10; byte offset =7
gy
0xA8CD210652 = 1010 1000 1100 1101 0010 0001 0000 0110 0101 0010
lo
First 7 bits from left are offset bits, next 10 bits are set bits and then next 4 bits are tile
no
number bits. Rest of the bits are tag bits. Extract tile bits.
So, tile bits are: 0000 = 0
ch
So the miss request travels from T4 to T0 through the NoC.
Te
of
te
itu
st
In
an
di
In