0% found this document useful (0 votes)
290 views170 pages

Computer Architecture (Syllabus, Assignment 1 and 2, Lessons 2 and 3)

This document contains an assignment for a computer architecture course. It includes 11 multiple part questions about computer architecture topics like RISC-V assembly code, CPI, MIPS, and performance analysis. Students are asked to write RISC-V assembly code corresponding to C statements, analyze processor performance based on clock rate and CPI, calculate execution times, and determine which of two processors or implementations would perform better for a given task. The assignment is submitted as a PDF and includes definitions of key terms and explanations of answers.

Uploaded by

benopij556
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
290 views170 pages

Computer Architecture (Syllabus, Assignment 1 and 2, Lessons 2 and 3)

This document contains an assignment for a computer architecture course. It includes 11 multiple part questions about computer architecture topics like RISC-V assembly code, CPI, MIPS, and performance analysis. Students are asked to write RISC-V assembly code corresponding to C statements, analyze processor performance based on clock rate and CPI, calculate execution times, and determine which of two processors or implementations would perform better for a given task. The assignment is submitted as a PDF and includes definitions of key terms and explanations of answers.

Uploaded by

benopij556
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 170

COSE222: Computer Architecture

Assignment #1
Due: Oct 6, 2021 (Wednesday) 11:59pm on Blackboard
Total points: 125

Please answer the following questions. Write your student ID and name on the top of the document.
Submit your homework with “PDF” format only. (You can easily generate the pdf files from Microsoft
Word or Hangul Word Processor (HWP). You can also handwrite your answers to scan the handwritten
documents with “PDF” format. You can use document capture applications such as “Microsoft Lens” for
scanning your documents with your smartphones.)
The answer rules:
(1) You can write answers in both Korean and English.
(2) Please make your final answer numbers have two decimal places.
(3) Performance of A is improved by NN % compared to performance B if PerfA / PerfB = 1.NN.

1. Write the definitions of the following terminologies.


(a) Moore’s Law

(b) Dennard Scaling [4]

(c) Abstraction [4]

2. Write the eight great ideas in computer architecture research mentioned during the class. [8]

3. Describe the steps that transform a program written in high-level language such as C into a
representation that is directly executed by a computer processor. [4]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00

https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
4. Computer A can execute a C program 10 times in one second and Computer B can execute the same C
program 20 times in one second. If the MIPS (million instructions per second) rate of Computer A is
MIPS_A and MIPS rate of Computer B is MIPS_B, then an engineer concludes that MIPS_B = MIPS_A x 2.
Under what conditions is this calculation correct? [5]

5. Student Y stated that the performance of the ARM processor using 2 GHz clock exhibits higher
performance than the x86 Pentium processor that runs with 1.5 GHz clock. Explain why the statement
by Student Y is not always true. Please take a counter example in your answer. [5]

6. Consider three different processors P1, P2, and P3 executing the same instruction set. P1 has 2.4GHz
clock rate and a CPI of 1.2. P2 has a 3.0GHz clock rate and CPI of 1.4. P3 has a 4.0GHz clock rate and has
a CPI of 2.2.
(a) Which processor has the highest performance expressed in instructions per second? [6]

(b) If the processors each execute a program in 10 seconds, find the number of cycles and the number
of instructions. [6]

7. Consider the following processors P1 and P2. Note that P1 and P2 have different ISAs, thus the total
number of instructions is different even if the application is written in the same high-level
programming language.

Processor Clock frequency CPI Instruction count


P1 3 GHz 2.5 3x109
P2 2 GHz 1.2 5x109

(1) Find the execution time of each processor. Which processor exhibits the better performance? [4]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00

https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
(2) We may use MIPS (millions of instructions per second) to compare the performance of two different
processors. Calculate MIPS for processors P1 and P2. Which processor has better performance only
considering MIPS? [4]

(3) Please explain why we cannot use MIPS directly to compare the performance of processors. [4]

8. Assume that for a certain program compiler A results in a dynamic instruction count of 1.1x109 and
has an execution time of 1.2 sec, while compiler B results in a dynamic instruction count of 1.5x109 and
an execution time of 1.7 sec.
(a) Find the average CPI for each program given that the processor has a clock cycle time of 1 ns. [6]

(b) Assume that the compiled programs run on two different processors X and Y. If the execution times
on the two processors are the same, how much faster is the clock of the processor Y running compiler
B’s code versus the clock of the processor X running compiler A’s code? Assume that the processors
have the same microarchitecture deploying the same ISA. [4]

(c) A new compiler is developed that uses only 5.0x108 instructions and has average CPI of 1.1. What is
the speedup of using this new compiler versus using compiler A or B on the original processor? [6]

9. Consider two different implementations of the same instruction set architecture. The instructions
can be divided into four classes according to their CPI (classes A, B, C, and D). Consider the following
two processors and an application.
P1: Clock frequency = 2.0GHz, CPIs for each instruction class = 1, 2, 3, 3
P2: Clock frequency = 3.0GHz, CPIs for each instruction class = 2, 2, 2, 2
Application:
Instruction count = 1.0x106,
fractions by instruction classes: class A = 20%, class B = 10%, class C = 40%, class D = 30%
a. Which processor is faster: P1 or P2? [8]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00

https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
b. What is the global CPI for each implementation? [8]

c. Figure out the clock cycles required in both cases. [4]

10. An enhancement is proposed to a computer (called the baseline). The enhancement merges some
commonly occurring instructions into a single instruction. But this enhancement increases the clock
cycle time by 25% for all instructions. Assume that 60% of the instructions in the baseline computer are
merged into just 30%. (That means 2 instructions are merged into one instruction.) Furthermore,
assume that in the baseline machine the CPI (clock per instruction) of the instructions that cannot be
merged is 50% higher than the CPI of the instructions that can be merged. What is the speedup of the
enhanced computer over the baseline? [15]

11. Assume a program requires the execution of 50x106 FP instructions, 110x106 INT instructions,
80x106 load/store instructions, and 16x106 branch instructions. The CPI for each type of instruction
(FP, INT, load/store, and branch types) is 1, 1, 4, and 2, respectively. Assume that the processor has a
2GHz clock rate.
a. By how much must we improve the CPI of FP instructions if we want the program to run two times
faster? [8]

b. By how much is the execution time of the program improved if the CPI of INT and FP instructions is
reduced by 40% and the CPI of load/store and branch instructions is reduced by 30%? [8]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00

https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
Powered by TCPDF (www.tcpdf.org)
COSE222: Computer Architecture
Assignment #2
Due: Oct 29, 2020 (Wednesday) 11:59pm on Blackboard
Total score: 191

Please answer for the questions. Write your student ID and name on the top of the document. Submit
h e k i h PDF f a l Y ca ea il ge e a e he df file f Mic f W d
HWP. Y ca al ha d i e a e ca he ha d i e d c e i h PDF f a Y
may e he d c e ca e a lica i ch a Office Le f ca i g d c e ih
your smartphones.)
The answer rules:
(1) You can write answers in both Korean and English.
(2) Please make your final answer numbers have two decimal places.
(3) Performance of A is improved by NN % compared to performance B if PerfA / PerfB = 1.NN.
Please refer the following RISC-V assembly instructions for your answers.

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
1. For the following C statement, write the corresponding RISC-V assembly code. Assume that the
variables f, g, h, i, and j are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that
the base address of the array A and B are in registers x10 and x11, respectively. The size of a single
element of the array A and B is 8 bytes. (Use only a register x30 for storing temporary values. You
should minimize the lines of your code. Do not use mul) [10]
B[10] = A[i+2*j];

2. For the RISC-V assembly instructions below, what is the corresponding C statement? Assume that the
variables f, g, h, i, and j are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that
the base address of the array A and B are in registers x10 and x11, respectively. The size of a single
element of the array A and B is 8 bytes. [10]
slli x30, x5, 3 // x30 = f*8
add x30, x10, x30 // x30 = &A[f]
slli x31, x6, 3 // x31 = g*8
add x31, x11, x31 // x31 = &B[g]
ld x5, 0(x30)
addi x12, x30, 24
ld x30, 0(x12)
add x30, x30, x5
sd x30, -16(x31)

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
3. Translate the following RISC-V code to one-line C statement. Assume that the variables f, g, h, i, and j
are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that the base address of the
array A and B are in registers x10 and x11, respectively. The size of a single element of the array A and B
is 8 bytes. [10]
addi x30, x10, 24
addi x31, x11, 8
ld x5, 0(x30)
ld x30, 8(x31)
add x30, x30, x5
addi x5, x30, 16
slli x5, x5, 3

4. Assume the following register contents initially:


x5 = 0x0000000000000000, x6 = 0x000FF000001FFF0F

(a) Figure out the values in the registers x5, x6, and x7 respectively when the following code completes
its execution. Represent your answer in a hexadecimal format. [15]

LOOP: beq x6, x0, DONE


srli x7, x6, 1
and x6, x6, x7
addi x5, x5, 1
jal x0, LOOP
DONE:

(b) Figure out how many times the beq instruction is executed. [5]

5. Suppose the program counter (PC) is set to 0x30000000.

(a) What range of addresses can be reached using the RISC-V jump-and-link (jal) instruction? (In other
words, what is the set of possible values for the PC after jal executes?) [4]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
(b) What range of addresses can be reached using the RISC-V branch-if-equal (beq) instruction? (In other
words, what is the set of possible values for the PC after beq executes?) [4]

6. Let us assume you are to translate the following C code to RISC-V assembly code. Assume that the
values of a, b, i, and j are in registers x5, x6, x7, and x29, respectively. Also, assume that register x10
holds the base address of the array D. You can use x30 and x31 as temporary registers. The size of a
single element of array D is 8 bytes.
for (i=0; i < a; i++)
for (j=0; j < b; j+=2)
D[3*j] = i - 7;

Complete the following RISC-V assembly code. [20]


addi x7, x0, 0
LOOPI: bge
addi
addi x29, x0, 0
LOOPJ: bge
addi
sd
addi
addi
jal
ENDJ: addi
jal
ENDI:

7. Consider the following RISC-V loop:


addi x5, x0, 0
addi x6, x0, 100
LOOP: blt x6, x0, DONE
addi x6, x6, -1
addi x5, x5, 4
jal x0, LOOP
DONE:

(a) What is the final value in register x5? Represent your answer in a hexadecimal format. [5]

(b) For the loop above, write the equivalent C code. Assume that the registers x5 and x6 are integers i
and j, respectively. [10]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
8. Translate the following function f into RISC-V assembly language. Assume the function declaration of
g is int g(int a, int b) and the label for function g is FUNC_G in the code. Let us assume that the
function g receives two arguments a and b via x10 and x11 respectively and return the result using
x10. The function f receives four arguments a, b, c, and d using x10, x11, x12, and x13 respectively and
return the result using x10. The size of all variables is 8 bytes. You should minimize the number of
instructions. [10]
int f(int a, int b, int c, int d) {
return g(g(a, b), c+d);
}

addi sp, sp, -16


sd x1, 0(sp)
add x5, x12, x13

9. Write the RISC-V assembly code that creates the 64-bit constant 0x0123456789ABCDEF and stores
that value to register x10. (hint: you should use lui twice.) [8]

10. Let us assume that a given processor does not provide dedicated multiplier hardware. In this case
we need to convert a multiplication into an equivalent function using other instructions. Assume you
are requested to design a multiplier function for two 32-bit unsigned integer numbers. In the following
C code, A and B are 32-bit unsigned integer types and operands of the multiplier function. Translate the
following C code to the RISC-V assembly code. The multiplication result is held in res. Assume that A, B,
i, and res holds the registers x5, x6, x7, and x28 respectively. You can use x29 and x30 for temporary
registers. [15]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
int i;
int res = 0;
for (i = 0; i < 32; i++) {
if (B & 0x01)
res += (A << i);
B = B >> 1;
}

11. Assume for a given processor the CPI of arithmetic instructions is 1, the CPI of load/store
instructions is 10, and the CPI of branch instructions is 3. Assume a program has following instructions
breakdown: 500 million arithmetic instructions, 300 million load/store instructions, 200 million branch
instructions.
(a) Suppose that new, more powerful arithmetic instructions are added to the instruction set. On
average, through the use of these more powerful arithmetic instructions, we can reduce the number of
arithmetic instructions needed to execute a program by 25%, while increasing the clock cycle time by
only 10%. Is this a good design choice? Why? [10]

(b) Suppose that we find a way to double the performance of arithmetic instructions. What is the overall
speedup of our machine? What if we find a way to improve the performance of arithmetic instructions
by 10 times? [10]

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
12. Answer for the following questions
(a) Translate the following C code to RISC-V assembly code straightforwardly. Assume that the
variables f and i are assigned to registers x5 and x6, respectively. Assume that the base address of the
array A and B are in registers x10 and x11, respectively. The size of a single element of the array A and B
is 8 bytes. (Use x28, x29, x30, and x31 as registers for temporary data. You should minimize the lines of
your code.) [15]
for (i = 200; i >= 0; i--) {
f = A[i];
A[i] = B[i];
B[i] = f;
}

(b) The above C code shown in Questions 3-(a) can be rewritten using pointers pA and pB as follows.
Translate the following C code to RISC-V assembly code. Assume that the variables f and i are assigned
to registers x5 and x6, respectively. Assume that the base address of the array A and B are in registers
x10 and x11, respectively. The size of a single element of the array A and B is 8 bytes. (Use x28, x29, x30,
and x31 as registers for temporary data. You should minimize the lines of your code.) [15]
pB = &B[200];
for (pA = &A[200]; pA >= &A[0]; pA = pA - 1) {
f = *pA
*pA = *pB;
*pB = f;
pB = pB – 1;
}

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
(c) Assuming the CPI of every instruction is the same, compare the performance of the above two codes.
[5]

13. Convert the decimal number 63.25 into binary representation using the following standard.
(a) IEEE 754 single precision format [5]
Sign =
Exponent =
Fraction =
(b) IEEE 754 double precision format [5]
Sign =
Exponent =
Fraction =

This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00

https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
Powered by TCPDF (www.tcpdf.org)
Computer Architecture
3. Instructions: Language of the Machine

Young Geun Kim


([email protected])
Intelligent Computer Architecture & Systems Lab.

1
RISC-V: Machine-level Representation
Representing Instructions
Instructions are encoded in binary
– Called machine code

RISC-V instructions
– Encoded as 32-bit instruction words
– Small number of formats encoding operation code (opcode), register
n mbers
– Regularity!

3
RISC-V Instruction Formats

4
RISC-V R-Type Instructions

5
R-Type Example

6
RISC-V I-Type Instructions

Immediate arithmetic or load instructions

Design Principle 3: Good design demands good compromises


– Different formats complicate decoding
– Keep formats as similar as possible
– Regularity!

7
RISC-V I-T e In ci n C n d

8
I-Type Example

9
RISC-V S-Type Instructions

10
S-Type Example

11
RISC-V SB-Type Instructions

12
SB-Type Example

13
SB-Type Example

14
RISC-V U-Type Instructions

15
U-Type Example

16
RISC-V UJ-Type Instructions

17
U-Type Example

18
RISC vs. CISC
RISC (Reduced Instruction Set Computer)
Philosophy: Fewer, simple instructions
– Might take more to get given task done
– Can execute them with small and fast hardware
– ARM, RISC-V MIPS IBM Po erPC

Register-oriented instruction set


– Many more (typically 32+) registers
– Use for arguments, return address, temporaries

Only load & store instructions can access memory


Each instruction has fixed size
No condition codes
– Test instructions return 0/1 in register
20
CISC (Complex Instruction Set Computer)
Add in ci n ef m ical g amming a k
– IA- In el IBM S s em

Stack-oriented instruction set


– Use stack to pass arguments, save PC, etc.
– Explicit push and pop instructions

Arithmetic instructions can access memory


– Requires memory read and write during computation
– Complex addressing modes

Instructions can have varying lengths


Condition codes
– Set as side effect of arithmetic or logical instructions 21
RISC vs. CISC
Original debate
– RISC proponents better for optimizing compilers, can make run fast with
simple chip design
– CISC proponents easy for compiler, fewer code bytes

Current status
– For desktop/server processors, choice of ISA not a technical issue
– With enough hardware, can make anything run fast
– Code compatibility more important
– x86-64 adopted many RISC features
– More registers, use them for argument passing
– Hardware translates instructions to simpler micro-operations
– For embedded processors, RISC makes sense: smaller, cheaper, less power
22
Designing an ISA
Important metrics
– Design cost to hardware and software
– Performance and other execution measurements
– Instructions and data
– Static measurements (code size)

Influence of ISA effectiveness


– Program usage
– Organization techniques
– Pipelining, memory hierarchies
– Compiler technology
– OS requirements
– Implementation technology (e.g., memory vs. logic, frequency vs. parallelism)
23
Four Design Principles
Simplicity favors regularity
Smaller is faster
Good design demands good compromises

Make the common case fast

24
Computer Architecture
3. Performance

Young Geun Kim


([email protected])
Intelligent Computer Architecture & Systems Lab.

1
How to Define Performance?
§ There are many ways to define something as “the best”
§ Airplane Example
Passenger Cruising Range Cruising Speed Throughput
Airplane
Capacity (miles) (m.p.h) (passengers * m.p.h)
Boeing 777 375 5,256 610 228,750
Boeing 747 416 7,156 610 286,700
Airbus 380 525 8,200 560 294,000
BAC/Sud Concorde 132 4,000 1,350 178,200
Douglas DC-8-50 146 8,720 544 79,424

§ Which is “the best”?

2
Applying to Computers
§ How do you decide which computer is the best?
– Processor Speed
– Memory Size
– Storage Speed
– Graphic Card / Subsystem
– Energy Efficiency
– Price
– Weight

§ How do you decide which to buy?


– Trade-off between cost and performance
– Recently, energy has been added (esp. for battery-powered systems).

3
Example of Metrics in Computers
Answers a month
Application program
Operations per second
Compiler
(Millions) of instructions per second: MIPS
(Giga) (F.P.) operations per second: GFLOPS/s
Operating System

ISA

Data path / control Cycles per instruction: CPI

Functional units
Clock frequency: GHz
Logic / transistors

§ Most of metrics are related to time (how fast)


– Time is the most classical metric for computers
4
Measuring Time
§ Looking at measuring CPU performance, we are primarily concerned with
execution time.
1
Performance =
Execution time

§ To compare, we say “X is n times faster than Y”


PerformanceX Execution timeY
n = =
PerformanceY Execution timeX

§ Increased performance = Reduced execution time


– Improved performance = Improved execution time

5
Example of Performance Comparison
§ X and Y do their homework
– X takes 5 hours
– Y takes 10 hours

§ Compare the performance


1 1
PerformanceX = = = 0.2
Execution TimeX 5 hours
1 1
PerformanceY = = = 0.1
Execution TimeY 10 hours

PerformanceX 0.2
n = = = 2
PerformanceY 0.1

§ So, X is two times faster than Y


6
Clock Cycle Time vs. Clock Rate
§ Clock Cycle Time
– Time required for a clock pulse to make transitions: 0 -> 1 -> 0
Positive edge Negative edge

Time
Clock cycle time

– In other words, the time duration between positive (negative) edges

§ Clock Rate
– Inverse of clock cycle time (= 1 / clock cycle time)
– # of times to visit positive (negative) state per second
– Unit: Hz or MHz or GHz
7
Example of Clock Cycle Time
§ A Machine is running at 100MHz
– Clock rate = 100 MHz = 100 * 106 cycles / sec
…….
Time
Positive state visited 108 times for 1 sec

– Clock cycle time = 1 / {(100 * 106) cycles / sec} = 10 ns

Time
10ns

8
How is clock cycle time determined?
§ Closely related to logic design

Storage element

Storage element



Clock

setup hold Longest propagation delay setup hold

§ Longest propagation delay


– a.k.a. critical path delay
– Critical path: a path that takes the longest timing delays among many
combinational paths 9
Execution Time
§ We will use CPU execution time frequently as the metric of how
long a program should run.
Execution time = Clock cycles for program * Clock cycle time

§ Since clock cycle time is the inverse of clock rate

Clock cycles for program


Execution time =
Clock rate

10
Measuring clock cycles
§ CPU clock cycles / program is not so intuitive.
– It is very difficult to estimate # of CPU clock cycles for your program.

§ CPI (Cycles Per Instruction) is used so frequently.


– The # of cycles per instruction varies depending on the instructions,
so CPI is an average value.
– CPIs can be compared between ISAs

11
Von Neumann Architecture
§ By John von Neumann, 1945

Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit

③ ① Data Movement Byte Addressable Array


② Arithmetic & Logical Ops Code + Data
Control Transfer Stack to Support Procedures
(e.g., if, while, for, …)
12
Using CPI
§ Therefore, we can rewrite
– Execution time = # of Instructions * CPI * Clock cycle time

Clock cycles for program


Execution time =
Clock rate

– Improved performance (= reduced execution time) is possible with


increased clock rate (= reduced clock cycle time), lower CPI,
or reduced # of instruction.

– Designers have to balance the length of each cycle and # of cycles


required.
13
CPI Examples
§ Machine A: 1ns clock and CPI of 2.0
§ Machine B: 2ns clock and CPI of 1.2
§ Which is faster?

14
Example Solution
§ Solve CPU time for each machine
– Execution timeA = I (= # of Instructions) * 2.0 * 1 ns = 2.0 * I ns
– Execution timeB = I (= # of Instructions) * 1.2 * 2 ns = 2.4 * I ns

§ Compare performance
PerformanceA Execution timeB 2.4 * I ns
= = = 1.2
PerformanceB Execution timeA 2.0 * I ns

§ So, machine A is 1.2 times faster than machine B


– CPI is not always correct…
– Why?
15
CPI Variability
§ Different types of instructions often take different numbers of
cycles on the same processor.
§ CPI is often reported for classes of instructions
n
∑ ( CPIi x Ci )
– CPI = i=1

Total instruction count

– CPIi : the CPI for i class of instructions


– Ci : the count of i type of instructions

16
CPI from Instruction Mix
n
∑ ( CPIi x Ci )
§ CPI =
i=1

Total instruction count

§ CPI Example
Instruction Class Frequency CPIi
ALU operations 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2

CPI = 0.43 x 1 + 0.21 x 2 + 0.12 x 2 + 0.24 x 2 = 1.57

Clock cycles = 1.57 * Instruction Count


17
Trade-offs
§ Instruction count, CPI, and clock rate present trade-offs
Application program

Compiler

Operating System

ISA
Instruction mix
Data path / control

Functional units CPI

Logic / transistors
Clock rate

18
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

Instruction
CPI Clock rate
count
Program ∨ Not just about the program algorithms

Compiler ∨
ISA ∨ ∨

Organization ∨ ∨

Technology ∨
19
Another Popular Performance Metrics
§ MIPS (million instructions per second)
– MIPS = Instruction Count
Execution Time x 106

– Problems
– It does not take into account the instruction set.

§ GFLOPS (giga floating-point operations per second)


§ TFLOPS (tera floating-point operations per second)
– Operations rather than instruction
– e.g., floating-point addition, multiplication, …

20
A Wrong Use Case of MIPS
§ Consider a 500 MHz Machine
Class CPI
Class A 1
Class B 2
Class C 3

§ Consider the two compilers


Instruction Counts (millions)
Code from
A B C
Compiler1 5 1 1
Compiler2 10 1 1

§ Which compiler produce faster code? Has a higher MIPS?


21
A Wrong Use Case of MIPS: Solution (I)
§ Compute Clock Cycles
n
– Clock cycles = ∑ ( CPIi x Ci )
i=1

– Clock cyclescomp1 = (1 x 5M) + (2 x 1M) + (3 x 1M) = 10M


– Clock cyclescomp2 = (1 x 10M) + (2 x 1M) + (3 x 1M) = 15M

§ Execution Time
– Execution time = (Instruction count x CPI) / Clock rate
= Clock cycles / Clock rate
– Executioncomp1 = 10M / 500M = 0.02 s
– Executioncomp2 = 15M / 500M = 0.03 s

§ Code from compiler 1 is 1.5x faster! 22


A Wrong Use Case of MIPS: Solution (II)
§ Computer MIPS
– MIPS = Instruction Count
Execution Time x 106

– MIPScomp1 = (5M + 1M + 1M) / (0.02M) = 350


– MIPScomp2 = (10M + 1M + 1M) / (0.03M) = 400

§ Code from compile 2 is faster??


– Fails to give a right answer!

23
Benchmarks
§ Users often want a performance metric.
§ A benchmark is distillation of the attributes of a workload.
– Real applications usually work best, but using them is not always feasible.

§ Desirable attributes
– Relevant: meaningful within the target domain
– Understandable
– Good metric(s)
– Scalable
– Coverage: does not oversimplify important factors in the target domain
– Acceptance: vendors and users embrace it

24
Benchmarks (cont’d)
§ De-facto industry standard benchmarks for CPU
– SPEC

§ SPEC
– Standard Performance Evaluation Cooperative
– Founded in 1988 by EE times, SUN, HP, MIPS, Apollo, DEC
– Several different SPEC benchmarks
– Most include a suit of several different applications (such as integer and
floating point components often reported separately)
– For more information, visit http://www.spec.org

25
Amdahl’s Law
Execution speedup is proportional to the size of the
improvement and the amount affected
§ Execution time after improvement
Execution time affected by improvement
= Amount of improvement + Execution time unaffected

§ Or
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced

§ Hardware-independent metrics do not work (e.g., code size).


26
Example – Amdahl’s Law
§ Floating point instructions improved to run 2x; but only 10% of
actual instructions are FP
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold

ExTimeold
Speedupoverall = = 1.053
0.95 x ExTimeold

ExTimeold
Speedupoverall =
ExTimenew
27
Example – Amdahl’s Law
n
∑ ( CPIi x Ci )
§ CPI =
i=1

Total instruction count

§ CPI Example
Instruction Class Frequency CPIi
ALU operations 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2

CPI = 0.43 x 1 + 0.21 x 2 + 0.12 x 2 + 0.24 x 2 = 1.57

Clock cycles = 1.57 * Instruction Count


28
Several Announcements
Surveys
– Monday vs. Wednesday for Exams

Q&A Zoom Session will be started from the next week!

1
Computer Architecture
1. Introduction

Young Geun Kim


([email protected])
Intelligent Computer Architecture & Systems Lab.

2
A Variety of Computer Systems

3
Classes of Computers
Personal Computers Super Computers
– General purpose running – High-end scientific and
a variety of software engineering calculations
– Subject to cost/performance + energy – Highest capability but accounting
tradeoff for a small fraction of the overall
computer market

Server Computers Embedded Computers


– Network based – Hidden as components of systems
– Range from small servers to – Stringent power/performance/cost
large data centers constraints
– High performance, capacity,
reliability

4
Opening the Box (iMac Pro)

5
Components of a Computer
CPU: Control + Datapath
Memory Focus of This Course
I/Os
– GPUs
– User-Interface Devices:
Display Keyboard Mouse Sound
– Storage Devices:
HDD SSD
– Network Adapters:
Ethernet, 3G/4G/5G, Wi-Fi Bluetooth

Same Components for All Kinds of Computers!

6
Von Neumann Architecture
By John von Neumann, 1945

Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit

① Data Movement Byte Addressable Array


② Arithmetic & Logical Ops Code + Data
Control Transfer Stack to Support Procedures
e g if while for
7
What is a Microprocessor?
Definition
– A processing unit built as a single tiny semiconductor chip

– Contains the arithmetic and control logic circuitry to perform the operations of
computer programs

– Intel x86, ARM, MIPS, RISC-V

Processor? Microprocessor? CPU?


– GPU? NPU?

8
Intel x86 History
4-bit 8-bit 16-bit 32-bit (i386)

32-bit (i586) 32-bit (i686) 64-bit (x86_64)

2009 2011
1st Gen. Core i72nd Gen. Core i7
(Nehalem) (Sandy Bridge)

2012 2013

3rd Gen. Core i7 4th Gen. Core i7


(Ivy Bridge) (Haswell)

9
Computer Ads.

10
The Moore s Law
The number of transistors
incorporated in a chip will
approximately double every 24
months
(Gordon Moore, Intel Co-founder, 1965)

Makes novel applications feasible


– WWW, search engines
– Smartphones, VR/AR
– AI, Self-driving Cars

Computers are getting more and


more pervasive.
11
High Power Density

Source: Intel Corp.


12
Uniprocessor Performance

13
Architecture: Exploiting Parallelism
Instruction Level Parallelism (ILP)
– Pipelining
– Superscalar
– Out-of-order Execution
– Branch Prediction
– VLIW (Very Long Instruction Word)

Data Level Parallelism (DLP)


– SIMD / Vector Instructions

Task Level Parallelism (TLP)


– Simultaneous Multithreading (Hyperthreading)
– Multicore

14
Shift to Multicore
ILP Wall
– Control Dependency
– Data Dependency

Memory Wall
– Memory Latency
Improved by
10% / year
– Cache Shows
Diminishing Returns

Power Wall

15
Intel s Core Duo
2 Cores on one chip
Two levels of caches
DL1 DL1
(L1, L2) on chip Core0 Core1
291 million transistors
IL1 IL1
in 143 mm2 with 65nm
technology

L2 Cache

Source: http://www.sandpile.org

16
Intel s Core i
4 Cores on one chip
Three levels of caches
(L1, L2, L3) on chip
731 million transistors
in 263 mm2 with 45nm
technology

17
Intel s Core i rd Gen.)

3rd Generation
Core i7
L1 64 KB

L2 256 KB

L3 8MB

1.4 billion transistors


in 160 mm2 with
22nm technology

18
How to Run Your Program?

19
Instruction Set Architecture (ISA)
The Hardware/Software Interface
– Hardware abstraction visible to software OS Compilers
– Instructions and their encodings, registers, data types, addressing modes, etc.
– Written documents about how the CPU behaves

20
A Case for Robot Cleaner
Commands (or Instructions)
– Go one step forward 00
– Go one step backward 01
– Turn left 10
– Turn right 11

A Simple Program

21
Von Neumann Architecture
By John von Neumann, 1945

Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit

① Data Movement Byte Addressable Array


② Arithmetic & Logical Ops Code + Data
Control Transfer Stack to Support Procedures
e g if while for
22
Full Levels of Abstraction

23
Abstraction is Good But
Abstraction helps us deal with complexity
– Hide lower-level details

These abstractions have limits


– Especially when you want to have better programs
– Need to understand details of underlying implementations

What is the right place to solve the problem?

This is the reason why you should take this course


seriously even if you don t want to be a computer
architect!

24
Topic 1: How to Design Interface?
Choices critically affect both the software programmer and hardware
designer
Example: Copying n bytes from address A to B

Trade-offs: code sizes, compiler complexity, operating frequency, number


of cycles to execute, hardware complexity, energy consumption, etc.
25
Topic 2: How to Implement?
Microarchitectures: Where should you spend transistors to run your
program faster with conforming to the given interface?

26
Topic 3: What About the Memory?
It s too slow!

27
Expectations
Hopefully, you will have a lot of fun throughout this class!
– It would take substantial time to finish your project though..

After successfully completing this course, you will have


confidence in how computer systems (including processors)
work for various fancy applications!

28
Computer Architecture
4. The Processor: Datapath and Control

Young Geun Kim


([email protected])
Intelligent Computer Architecture & Systems Lab.

1
Computer Organization

Computer
Processor Memory Devices

Control Input

Datapath Output

2
Big Picture: Processor Implementation
§ Key ideas
– Concept of datapath and control
– Where the instruction and data bits go

§ Approach
– Start with a simple implementation and iteratively improve it

§ We will examine three RISC-V implementations


– A simplified single-cycle version
– An advanced multi-cycle version
– A more realistic pipelined version

3
Subset of Instructions
§ To simplify out study of processor design, we will focus on
simple subset of RV32I, yet covers most aspects
– Data transfer: lw, sw
– Arithmetic/logical: add, sub, and, or
– Control transfer: beq

4
RISC-V Format Review

5
Execution Cycle
§ The lifecycle of an instruction
Instruction
Fetch § Obtain instruction from instruction memory

Instruction
§ Determine required actions
Decode

Operand
§ Locate and obtain operand data
Decode

Execute § Compute result value or status

Result
§ Deposit results in storage for later use
Store

Instruction
§ Determine the following instruction
Next
6
Implementation Overview
§ Data flows through memory and functional units

Instruction Data Data


Memory Memory

Registers
Register #

ALU
PC Address Address
Register #
Instruction Register #
Data

7
Digital Systems
§ Three components required to implement a digital system
– Combinational elements
– Output is dependent only on current inputs
– E.g., ALU
– Sequential elements
– Element contains state information (i.e., memory element)
– E.g., registers
– Clock signals regulate the updating of the memory elements

8
Combinational Elements
§ Acyclic network of logic gates
– Continuously responds to changes on primary inputs
– Primary outputs become (after some delay) functions of primary inputs

9
Combinational Elements: Examples
§ AND-gate § Adder
– Y=A&B – Y=A+B

§ Multiplexer § Arithmetic/Logic Unit


– Y=S?A:B – Y = F ( A, B )

10
Storage Elements: Register
§ Register
– Based on the D Flip Flops
– N-bit input and output
– Write enable input
– Write Enable:
– 0: Data Out will not change
– 1: Data Out will become Data in

– Stored data changes only on rising (or falling) clock edge

11
Storage Elements: Register File
§ Register File consists of 32 registers
– Two 32-bit output busses:
– Read data1 and Read data2
– One 32-bit input bus: Write data
– x0 hard-wired to value 0
CLK
§ Register is selected by:
– Read register1 selects the register to put on Read data1
– Read register2 selects the register to put on Read data2
– Write register selects the register to be written with Write data when
RegWrite = 1

§ Clock input (CLK)


– The CLK input is a factor only for write operation 12
Storage Elements: Memory
§ Memory has two busses
– One output bus: Read data (Data Out)
– One input bus: Write data (Data In)

§ Address
– Selects the word to put on Data Out when MemRead = 1
– The word to be written via the Data In when MemWrite = 1

§ Clock input (CLK)


– The CLK input is a factor only for write operation
– During read, behaves as combinational logic block
– Valid address -> Data Out valid after "access time"

13
Instruction Fetch (IF) RTL
§ Common RTL operations
– Fetch instruction
– Mem[PC]; // fetch instruction from instruction memory
– Update program counter
– PC <- PC + 4; // calculate next address

14
Datapath: IF Unit

15
Add RTL
§ Add instruction
– add rd, rs1, rs2
– Mem[PC]; // fetch instruction from instruction memory
– R[rd] <- R[rs1] + R[rs2] // ADD instruction
– PC <- PC + 4; // calculate next address

0000000 000 0110011

16
Datapath: Reg/Reg Operations
§ R[rd] <- R[rs1] op R[rs2];
– ALU control and RegWrite based on decoded instruction
– Read register1, read register2, and write register from rs1, rs2, rd fields

17
OR Immediate RTL
§ OR immediate instruction
– ori rd, rs1, imm12
– Mem[PC]; // fetch instruction from instruction memory
– R[rd] <- R[rs1] OR SignExt[imm12] // OR operation with Sign-Extension
– PC <- PC + 4; // calculate next address

110 0010011

18
Datapath: Immediate Operations
§ 1 Mux and Immediate Generation Unit are added
– Mux: selects data from register if R-format by ALUSrc

ALUSrc

0
M
u
x
1

imm12 Imm
Gen

19
Load RTL
§ Load instruction
– lw rd, imm12(rs1)
– Mem[PC]; // fetch instruction from instruction memory
– Addr <- R[rs1] + SignExt(imm12); // Compute memory address
– R[rd] <- Mem[Addr]; // Load data into register
– PC <- PC + 4; // calculate next address

010 0000011

20
Datapath: Load
§ Sign extension logic is added
– Offset can be either positive or negative

ALUSrc

0
M
u
x
1

Imm
Gen

21
Store RTL
§ Store instruction
– sw rs2, imm12(rs1)
– Mem[PC]; // fetch instruction from instruction memory
– Addr <- R[rs1] + SignExt(imm12); // Compute memory address
– Mem[Addr] <- R[rs2]; // Store data into memory
– PC <- PC + 4; // calculate next address

010 0100011

22
Datapath: R-Type/Load/Store
§ A path from register to memory has been created
§ 1 Mux is added to select data to be stored in registers

23
Branch RTL
§ Branch instruction
– beq rs1, rs2, imm12
– Mem[PC]; // fetch instruction from instruction memory
– Zero <- (R[rs1] – R[rs2]) + 1; // Use ALU, subtract and check Zero output
– If (Zero == 1) then
– PC <- PC + (SignExt(imm12) << 1); // Branch if equal
– else
– PC <- PC + 4; // Keep going otherwise

000 1100011

24
Datapath: Branch

25
Putting It All Together

26
Adding Control
§ Design Steps
– Identify control points for pieces of the datapath

– Categorize type of control signals


– Flow of data through multiplexors
– Writes of state information

– Derive control signals for each instruction

– Put it all together!

27
Single-Cycle Datapath
PCSrc

0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2

Instruction [19-15] Read


Read MemWrite
Read register 1
PC data 1
address Instruction [24-20] Read ALUSrc MemtoReg
register 2 Zero
Instruction
Instruction [11-7] Write Read 0 ALU ALU
[31-0] result Address Read 1
register data 2 M
Instruction data M
Write Registers U
memory U
data X
Data X
1 Write
memory 0
32 32 data
Instruction [31-0] Imm
Gen ALU
MemRead
control

ALUOp

§ This datapath supports the following instructions


– add, sub, and, or, lw, sw, beq
28
Single-Cycle Control
Signal Description
RegWrite Specify if the destination register needs to be written
ALUSrc Select whether source of ALU is register or immediate
ALUOp Specify operation of ALU
MemWrite Specify whether memory needs to be written
MemRead Specify whether memory needs to read
Select whether memory or ALU output is used for
MemtoReg
“Write data” of register
Select whether PC + 4 or branch target address is
PCSrc
used for the next PC

29
R-Type Instruction Dataflow
§ For add, sub, and, or instructions
PCSrc

0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2

Instruction [19-15] Read


Read MemWrite
Read register 1
PC data 1
address Instruction [24-20] Read ALUSrc MemtoReg
register 2 Zero
Instruction
Instruction [11-7] Write Read 0 ALU ALU
[31-0] result Address Read 1
register data 2 M
Instruction data M
Write Registers U
memory U
data X
Data X
1 Write
memory 0
32 32 data
Instruction [31-0] Imm
Gen ALU
MemRead
control

ALUOp

30
R-Type Instruction Control

Signal Value Description


RegWrite 1 To enable writing rd
ALUSrc 0 To select the Read Data 2 from register file
ALUOp OP To select an appropriate operation for ALU
MemWrite 0 To disable writing memory
MemRead 0 To disable reading memory
MemtoReg 0 To select ALU output for “Write Data” of register
PCSrc 0 To select PC + 4

31
I-Type Load Instruction Dataflow
§ For lw instruction
PCSrc

0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2

Instruction [19-15] Read


Read MemWrite
Read register 1
PC data 1
address Instruction [24-20] Read ALUSrc MemtoReg
register 2 Zero
Instruction
Instruction [11-7] Write Read 0 ALU ALU
[31-0] result Address Read 1
register data 2 M
Instruction data M
Write Registers U
memory U
data X
Data X
1 Write
memory 0
32 32 data
Instruction [31-0] Imm
Gen ALU
MemRead
control

ALUOp

32
I-Type Load Instruction Control

Signal Value Description


RegWrite 1 To enable writing rd
ALUSrc 1 To select the Immediate Offset Value from instruction
ALUOp OP To add “Read Data 1” and Immediate Offset
MemWrite 0 To disable writing memory
MemRead 1 To enable reading memory
MemtoReg 1 To select memory output for “Write Data” of register
PCSrc 0 To select PC + 4

33
S-Type Store Instruction Dataflow
§ For sw instruction
PCSrc

0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2

Instruction [19-15] Read


Read MemWrite
Read register 1
PC data 1
address Instruction [24-20] Read ALUSrc MemtoReg
register 2 Zero
Instruction
Instruction [11-7] Write Read 0 ALU ALU
[31-0] result Address Read 1
register data 2 M
Instruction data M
Write Registers U
memory U
data X
Data X
1 Write
memory 0
32 32 data
Instruction [31-0] Imm
Gen ALU
MemRead
control

ALUOp

34
S-Type Store Instruction Control

Signal Value Description


RegWrite 0 To disable writing rd
ALUSrc 1 To select the Immediate Offset Value from instruction
ALUOp OP To add “Read Data 1” and Immediate Offset
MemWrite 1 To enable writing memory
MemRead 0 To disable reading memory
MemtoReg X Not used
PCSrc 0 To select PC + 4

35
SB-Type Branch Instruction Dataflow
§ For beq instruction
PCSrc

0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2

Instruction [19-15] Read


Read MemWrite
Read register 1
PC data 1
address Instruction [24-20] Read ALUSrc MemtoReg
register 2 Zero
Instruction
Instruction [11-7] Write Read 0 ALU ALU
[31-0] result Address Read 1
register data 2 M
Instruction data M
Write Registers U
memory U
data X
Data X
1 Write
memory 0
32 32 data
Instruction [31-0] Imm
Gen ALU
MemRead
control

ALUOp

36
SB-Type Branch Instruction Control

Signal Value Description


RegWrite 0 To disable writing rd
ALUSrc 0 To select the Read Data 2 from register file
ALUOp OP To sub “Read Data 1” and “Read Data 2”
MemWrite 0 To disable writing memory
MemRead 0 To disable reading memory
MemtoReg X Not used
PCSrc 1 To select target address

37
ALU Control
§ ALU used for
– R-type: F depends on opcode
– Load/Store: F = add
– Branch: F = subtract

38
ALU Control (Cont’d)
§ Assume 2-bit ALUOp derived from opcode
– Combinational logic derives ALU control

39
Main Control Unit
§ Control signals derived from instruction

40
Put It All Together

41
Single Cycle Processor
§ Advantages
– Single cycle per instruction makes logic and clock simple

§ Disadvantages
– Cycle time is determined by the worst-case path
– Critical path: load instruction
– Instruction memory -> register file –> ALU -> data memory -> register file
– Not feasible to adapt to different instructions
– Different instruction can have different length of time
– Inefficient utilization of memory and functional units

§ We will improve performance with different approaches


42
To Mitigate the Disadvantages
§ Multicycle Implementation
– Divide each instruction into a series of steps
– Each step will take one clock cycle
– Different instructions can have different CPI
§ Requires a few significant changes to organization
– Use registers to separate each stage
§ Example
– R-format instruction (4 cycles)
– (1) Instruction fetch (2) Instruction decode/register fetch (3) ALU
operation (4) Register write
– Load instruction (5 cycles)
– (1) Instruction fetch (2) Instruction decode/register fetch (3) Address
computation (4) Memory read (5) Register write
43
Computer Architecture
4. The Processor: Datapath and Control

Young Geun Kim


([email protected])
Intelligent Computer Architecture & Systems Lab.

1
Single Cycle Processor
§ Advantages
– Single cycle per instruction makes logic and clock simple

§ Disadvantages
– Cycle time is determined by the worst-case path
– Critical path: load instruction
– Instruction memory -> register file –> ALU -> data memory -> register file
– Not feasible to adapt to different instructions
– Different instruction can have different length of time
– Inefficient utilization of memory and functional units

§ We will improve performance with different approaches


2
To Mitigate the Disadvantages
§ Multicycle Implementation
– Divide each instruction into a series of steps
– Each step will take one clock cycle
– Different instructions can have different CPI
§ Requires a few significant changes to organization
– Use registers to separate each stage
§ Example
– R-format instruction (4 cycles)
– (1) Instruction fetch (2) Instruction decode/register fetch (3) ALU
operation (4) Register write
– Load instruction (5 cycles)
– (1) Instruction fetch (2) Instruction decode/register fetch (3) Address
computation (4) Memory read (5) Register write
3
Multicycle Divisions
§ Divide datapath into steps (1 cycle each)
§ Instructions can range from 3 - 5 stages in our multicycle
Instruction
Fetch § Obtain instruction from instruction memory

Register
§ Locate and obtain operand data
Fetch

Execute § Compute result value or status

Memory
§ Load and/or store values from/to memory
Access

Write
§ Deposit results in storage for later use
Back

Instruction
§ Determine the following instruction
Next
4
Multicycle Divisions (Cont’d)
§ From datapath point of view

Registers
Instruction

Registers
PC Add
ALU Data
Result
Memory Memory

IF RF EX MEM WB
Instruction Register Execution Memory Write Back
Fetch Fetch

5
Multicycle Implementation Overview
0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

* W/O control signals

§ Single memory unit for instructions and data


– Registers used to store output during instruction execution
§ Single ALU used for arithmetic/logic, memory addresses, and next instructions
6
Specific Changes
§ A single memory unit is used for both instructions and data

§ There is a single ALU rather than an ALU and additional adder

§ One or more registers are added after every major functional


unit to hold output
– Instruction register
– Needs to have Write line since instruction kept across multiple cycles
– Memory data register (MDR)
– A and B registers
– ALUOut register

7
Stage Description of Multicycle Design
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

8
Step1: Instruction Fetch
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

9
Step1: Instruction Fetch (Cont’d)
§ RTL Description
IR = Mem[PC]
PC = PC + 4

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

10
Step2: Instruction Decode and Register Fetch
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

11
Step2: Instruction Decode and Register Fetch (Cont’d)
§ RTL Description
A = Reg[IR[19-15]]
B = Reg[IR[24-20]]
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

12
Step2 Special Note
§ Control Lines
– Not dependent on instruction type
– Instruction is still being decoded at this step
– ALU used to calculate branch destination just in case we decode
a branch instruction

13
Step3: R-Type Execution
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

14
Step3: R-Type Execution
§ RTL Description
ALUOut = A op B

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

15
Step4: R-Type Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

16
Step4: R-Type Completion Step
§ RTL Description
Reg[IR[11-7]] = ALUOut

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

17
Step3: Branch Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

18
Step3: Branch Completion Step (Cont’d)
§ RTL Description
IR = Mem[PC]
PC = PC + 4

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

19
Step3: Memory Execution
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

20
Step3: Memory Execution (Cont’d)
§ RTL Description
ALUOut = A + sign-extend (imm12)

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

21
Step4: Load Memory Access Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

22
Step4: Load Memory Access Step (Cont’d)
§ RTL Description
MDR = Memory[ALUOut]

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

23
Step5: Load Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

24
Step5: Load Completion Step (Cont’d)
§ RTL Description
Reg[IR[11-7]] = ALUOut

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

25
Step4: Store Memory Access Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

26
Step4: Store Memory Access Step (Cont’d)
§ RTL Description
Memory[ALUOut] = B

0 M
U
1 X

Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register

27
Summary of Multicycle Steps
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR

§ Load is the most complicate instruction


28
Control Signals
§ Control signals no longer determined solely from decoded
instruction
– Finite state machine used for sequencing

29
CPI of the Multicycle Implementation
§ Number of clock cycles
– Loads: 5
– Stores: 4
– R-format instructions: 4
– Branches (& Jumps): 3

§ Instruction mix
– 22% Loads, 11% Stores, 49% R-format instructions, 16% Branches, and 2%
Jumps

§ CPI = 0.22 x 5 + 0.11 x 4 + 0.49 x 4 + 0.16 x 3 + 0.02 x 3 = 4.04


– CPI of multicycle would be higher than that of single cycle
– But, clock cycle time of multicycle is much shorter that that of single cycle 30
Pros and Cons of Multicycle Design
§ Advantages
– Shorter cycle time
– Simple instructions executed in short period of time
– Variable cycles per instruction no longer restricts to worst case
– Functional units can be used more than once/instruction
– Less hardware required to implement processor

§ Disadvantages
– Requires additional registers to store between stages
– More timing paths to design, analyze, and tune

31
Summary
§ Processor design requires refinement of datapath and control

§ Disadvantages of the single cycle implementation


– Long cycle time, too long for all instructions except for the slowest
– Inefficient hardware utilization with unnecessarily duplicated resources

§ Multicycle Implementation
– Partition execution into small steps of comparable duration

32

You might also like