Computer Architecture (Syllabus, Assignment 1 and 2, Lessons 2 and 3)
Computer Architecture (Syllabus, Assignment 1 and 2, Lessons 2 and 3)
Assignment #1
Due: Oct 6, 2021 (Wednesday) 11:59pm on Blackboard
Total points: 125
Please answer the following questions. Write your student ID and name on the top of the document.
Submit your homework with “PDF” format only. (You can easily generate the pdf files from Microsoft
Word or Hangul Word Processor (HWP). You can also handwrite your answers to scan the handwritten
documents with “PDF” format. You can use document capture applications such as “Microsoft Lens” for
scanning your documents with your smartphones.)
The answer rules:
(1) You can write answers in both Korean and English.
(2) Please make your final answer numbers have two decimal places.
(3) Performance of A is improved by NN % compared to performance B if PerfA / PerfB = 1.NN.
2. Write the eight great ideas in computer architecture research mentioned during the class. [8]
3. Describe the steps that transform a program written in high-level language such as C into a
representation that is directly executed by a computer processor. [4]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00
https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
4. Computer A can execute a C program 10 times in one second and Computer B can execute the same C
program 20 times in one second. If the MIPS (million instructions per second) rate of Computer A is
MIPS_A and MIPS rate of Computer B is MIPS_B, then an engineer concludes that MIPS_B = MIPS_A x 2.
Under what conditions is this calculation correct? [5]
5. Student Y stated that the performance of the ARM processor using 2 GHz clock exhibits higher
performance than the x86 Pentium processor that runs with 1.5 GHz clock. Explain why the statement
by Student Y is not always true. Please take a counter example in your answer. [5]
6. Consider three different processors P1, P2, and P3 executing the same instruction set. P1 has 2.4GHz
clock rate and a CPI of 1.2. P2 has a 3.0GHz clock rate and CPI of 1.4. P3 has a 4.0GHz clock rate and has
a CPI of 2.2.
(a) Which processor has the highest performance expressed in instructions per second? [6]
(b) If the processors each execute a program in 10 seconds, find the number of cycles and the number
of instructions. [6]
7. Consider the following processors P1 and P2. Note that P1 and P2 have different ISAs, thus the total
number of instructions is different even if the application is written in the same high-level
programming language.
(1) Find the execution time of each processor. Which processor exhibits the better performance? [4]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00
https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
(2) We may use MIPS (millions of instructions per second) to compare the performance of two different
processors. Calculate MIPS for processors P1 and P2. Which processor has better performance only
considering MIPS? [4]
(3) Please explain why we cannot use MIPS directly to compare the performance of processors. [4]
8. Assume that for a certain program compiler A results in a dynamic instruction count of 1.1x109 and
has an execution time of 1.2 sec, while compiler B results in a dynamic instruction count of 1.5x109 and
an execution time of 1.7 sec.
(a) Find the average CPI for each program given that the processor has a clock cycle time of 1 ns. [6]
(b) Assume that the compiled programs run on two different processors X and Y. If the execution times
on the two processors are the same, how much faster is the clock of the processor Y running compiler
B’s code versus the clock of the processor X running compiler A’s code? Assume that the processors
have the same microarchitecture deploying the same ISA. [4]
(c) A new compiler is developed that uses only 5.0x108 instructions and has average CPI of 1.1. What is
the speedup of using this new compiler versus using compiler A or B on the original processor? [6]
9. Consider two different implementations of the same instruction set architecture. The instructions
can be divided into four classes according to their CPI (classes A, B, C, and D). Consider the following
two processors and an application.
P1: Clock frequency = 2.0GHz, CPIs for each instruction class = 1, 2, 3, 3
P2: Clock frequency = 3.0GHz, CPIs for each instruction class = 2, 2, 2, 2
Application:
Instruction count = 1.0x106,
fractions by instruction classes: class A = 20%, class B = 10%, class C = 40%, class D = 30%
a. Which processor is faster: P1 or P2? [8]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00
https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
b. What is the global CPI for each implementation? [8]
10. An enhancement is proposed to a computer (called the baseline). The enhancement merges some
commonly occurring instructions into a single instruction. But this enhancement increases the clock
cycle time by 25% for all instructions. Assume that 60% of the instructions in the baseline computer are
merged into just 30%. (That means 2 instructions are merged into one instruction.) Furthermore,
assume that in the baseline machine the CPI (clock per instruction) of the instructions that cannot be
merged is 50% higher than the CPI of the instructions that can be merged. What is the speedup of the
enhanced computer over the baseline? [15]
11. Assume a program requires the execution of 50x106 FP instructions, 110x106 INT instructions,
80x106 load/store instructions, and 16x106 branch instructions. The CPI for each type of instruction
(FP, INT, load/store, and branch types) is 1, 1, 4, and 2, respectively. Assume that the processor has a
2GHz clock rate.
a. By how much must we improve the CPI of FP instructions if we want the program to run two times
faster? [8]
b. By how much is the execution time of the program improved if the CPI of INT and FP instructions is
reduced by 40% and the CPI of load/store and branch instructions is reduced by 30%? [8]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:46 GMT -05:00
https://www.coursehero.com/file/113831275/COSE222-HW1pdf/
Powered by TCPDF (www.tcpdf.org)
COSE222: Computer Architecture
Assignment #2
Due: Oct 29, 2020 (Wednesday) 11:59pm on Blackboard
Total score: 191
Please answer for the questions. Write your student ID and name on the top of the document. Submit
h e k i h PDF f a l Y ca ea il ge e a e he df file f Mic f W d
HWP. Y ca al ha d i e a e ca he ha d i e d c e i h PDF f a Y
may e he d c e ca e a lica i ch a Office Le f ca i g d c e ih
your smartphones.)
The answer rules:
(1) You can write answers in both Korean and English.
(2) Please make your final answer numbers have two decimal places.
(3) Performance of A is improved by NN % compared to performance B if PerfA / PerfB = 1.NN.
Please refer the following RISC-V assembly instructions for your answers.
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
1. For the following C statement, write the corresponding RISC-V assembly code. Assume that the
variables f, g, h, i, and j are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that
the base address of the array A and B are in registers x10 and x11, respectively. The size of a single
element of the array A and B is 8 bytes. (Use only a register x30 for storing temporary values. You
should minimize the lines of your code. Do not use mul) [10]
B[10] = A[i+2*j];
2. For the RISC-V assembly instructions below, what is the corresponding C statement? Assume that the
variables f, g, h, i, and j are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that
the base address of the array A and B are in registers x10 and x11, respectively. The size of a single
element of the array A and B is 8 bytes. [10]
slli x30, x5, 3 // x30 = f*8
add x30, x10, x30 // x30 = &A[f]
slli x31, x6, 3 // x31 = g*8
add x31, x11, x31 // x31 = &B[g]
ld x5, 0(x30)
addi x12, x30, 24
ld x30, 0(x12)
add x30, x30, x5
sd x30, -16(x31)
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
3. Translate the following RISC-V code to one-line C statement. Assume that the variables f, g, h, i, and j
are assigned to registers x5, x6, x7, x28, and x29, respectively. Assume that the base address of the
array A and B are in registers x10 and x11, respectively. The size of a single element of the array A and B
is 8 bytes. [10]
addi x30, x10, 24
addi x31, x11, 8
ld x5, 0(x30)
ld x30, 8(x31)
add x30, x30, x5
addi x5, x30, 16
slli x5, x5, 3
(a) Figure out the values in the registers x5, x6, and x7 respectively when the following code completes
its execution. Represent your answer in a hexadecimal format. [15]
(b) Figure out how many times the beq instruction is executed. [5]
(a) What range of addresses can be reached using the RISC-V jump-and-link (jal) instruction? (In other
words, what is the set of possible values for the PC after jal executes?) [4]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
(b) What range of addresses can be reached using the RISC-V branch-if-equal (beq) instruction? (In other
words, what is the set of possible values for the PC after beq executes?) [4]
6. Let us assume you are to translate the following C code to RISC-V assembly code. Assume that the
values of a, b, i, and j are in registers x5, x6, x7, and x29, respectively. Also, assume that register x10
holds the base address of the array D. You can use x30 and x31 as temporary registers. The size of a
single element of array D is 8 bytes.
for (i=0; i < a; i++)
for (j=0; j < b; j+=2)
D[3*j] = i - 7;
(a) What is the final value in register x5? Represent your answer in a hexadecimal format. [5]
(b) For the loop above, write the equivalent C code. Assume that the registers x5 and x6 are integers i
and j, respectively. [10]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
8. Translate the following function f into RISC-V assembly language. Assume the function declaration of
g is int g(int a, int b) and the label for function g is FUNC_G in the code. Let us assume that the
function g receives two arguments a and b via x10 and x11 respectively and return the result using
x10. The function f receives four arguments a, b, c, and d using x10, x11, x12, and x13 respectively and
return the result using x10. The size of all variables is 8 bytes. You should minimize the number of
instructions. [10]
int f(int a, int b, int c, int d) {
return g(g(a, b), c+d);
}
9. Write the RISC-V assembly code that creates the 64-bit constant 0x0123456789ABCDEF and stores
that value to register x10. (hint: you should use lui twice.) [8]
10. Let us assume that a given processor does not provide dedicated multiplier hardware. In this case
we need to convert a multiplication into an equivalent function using other instructions. Assume you
are requested to design a multiplier function for two 32-bit unsigned integer numbers. In the following
C code, A and B are 32-bit unsigned integer types and operands of the multiplier function. Translate the
following C code to the RISC-V assembly code. The multiplication result is held in res. Assume that A, B,
i, and res holds the registers x5, x6, x7, and x28 respectively. You can use x29 and x30 for temporary
registers. [15]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
int i;
int res = 0;
for (i = 0; i < 32; i++) {
if (B & 0x01)
res += (A << i);
B = B >> 1;
}
11. Assume for a given processor the CPI of arithmetic instructions is 1, the CPI of load/store
instructions is 10, and the CPI of branch instructions is 3. Assume a program has following instructions
breakdown: 500 million arithmetic instructions, 300 million load/store instructions, 200 million branch
instructions.
(a) Suppose that new, more powerful arithmetic instructions are added to the instruction set. On
average, through the use of these more powerful arithmetic instructions, we can reduce the number of
arithmetic instructions needed to execute a program by 25%, while increasing the clock cycle time by
only 10%. Is this a good design choice? Why? [10]
(b) Suppose that we find a way to double the performance of arithmetic instructions. What is the overall
speedup of our machine? What if we find a way to improve the performance of arithmetic instructions
by 10 times? [10]
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
12. Answer for the following questions
(a) Translate the following C code to RISC-V assembly code straightforwardly. Assume that the
variables f and i are assigned to registers x5 and x6, respectively. Assume that the base address of the
array A and B are in registers x10 and x11, respectively. The size of a single element of the array A and B
is 8 bytes. (Use x28, x29, x30, and x31 as registers for temporary data. You should minimize the lines of
your code.) [15]
for (i = 200; i >= 0; i--) {
f = A[i];
A[i] = B[i];
B[i] = f;
}
(b) The above C code shown in Questions 3-(a) can be rewritten using pointers pA and pB as follows.
Translate the following C code to RISC-V assembly code. Assume that the variables f and i are assigned
to registers x5 and x6, respectively. Assume that the base address of the array A and B are in registers
x10 and x11, respectively. The size of a single element of the array A and B is 8 bytes. (Use x28, x29, x30,
and x31 as registers for temporary data. You should minimize the lines of your code.) [15]
pB = &B[200];
for (pA = &A[200]; pA >= &A[0]; pA = pA - 1) {
f = *pA
*pA = *pB;
*pB = f;
pB = pB – 1;
}
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
(c) Assuming the CPI of every instruction is the same, compare the performance of the above two codes.
[5]
13. Convert the decimal number 63.25 into binary representation using the following standard.
(a) IEEE 754 single precision format [5]
Sign =
Exponent =
Fraction =
(b) IEEE 754 double precision format [5]
Sign =
Exponent =
Fraction =
This study source was downloaded by 100000855754497 from CourseHero.com on 04-21-2023 00:58:42 GMT -05:00
https://www.coursehero.com/file/113831697/COSE222-HW2pdf/
Powered by TCPDF (www.tcpdf.org)
Computer Architecture
3. Instructions: Language of the Machine
1
RISC-V: Machine-level Representation
Representing Instructions
Instructions are encoded in binary
– Called machine code
RISC-V instructions
– Encoded as 32-bit instruction words
– Small number of formats encoding operation code (opcode), register
n mbers
– Regularity!
3
RISC-V Instruction Formats
4
RISC-V R-Type Instructions
5
R-Type Example
6
RISC-V I-Type Instructions
7
RISC-V I-T e In ci n C n d
8
I-Type Example
9
RISC-V S-Type Instructions
10
S-Type Example
11
RISC-V SB-Type Instructions
12
SB-Type Example
13
SB-Type Example
14
RISC-V U-Type Instructions
15
U-Type Example
16
RISC-V UJ-Type Instructions
17
U-Type Example
18
RISC vs. CISC
RISC (Reduced Instruction Set Computer)
Philosophy: Fewer, simple instructions
– Might take more to get given task done
– Can execute them with small and fast hardware
– ARM, RISC-V MIPS IBM Po erPC
Current status
– For desktop/server processors, choice of ISA not a technical issue
– With enough hardware, can make anything run fast
– Code compatibility more important
– x86-64 adopted many RISC features
– More registers, use them for argument passing
– Hardware translates instructions to simpler micro-operations
– For embedded processors, RISC makes sense: smaller, cheaper, less power
22
Designing an ISA
Important metrics
– Design cost to hardware and software
– Performance and other execution measurements
– Instructions and data
– Static measurements (code size)
24
Computer Architecture
3. Performance
1
How to Define Performance?
§ There are many ways to define something as “the best”
§ Airplane Example
Passenger Cruising Range Cruising Speed Throughput
Airplane
Capacity (miles) (m.p.h) (passengers * m.p.h)
Boeing 777 375 5,256 610 228,750
Boeing 747 416 7,156 610 286,700
Airbus 380 525 8,200 560 294,000
BAC/Sud Concorde 132 4,000 1,350 178,200
Douglas DC-8-50 146 8,720 544 79,424
2
Applying to Computers
§ How do you decide which computer is the best?
– Processor Speed
– Memory Size
– Storage Speed
– Graphic Card / Subsystem
– Energy Efficiency
– Price
– Weight
3
Example of Metrics in Computers
Answers a month
Application program
Operations per second
Compiler
(Millions) of instructions per second: MIPS
(Giga) (F.P.) operations per second: GFLOPS/s
Operating System
ISA
Functional units
Clock frequency: GHz
Logic / transistors
5
Example of Performance Comparison
§ X and Y do their homework
– X takes 5 hours
– Y takes 10 hours
PerformanceX 0.2
n = = = 2
PerformanceY 0.1
Time
Clock cycle time
§ Clock Rate
– Inverse of clock cycle time (= 1 / clock cycle time)
– # of times to visit positive (negative) state per second
– Unit: Hz or MHz or GHz
7
Example of Clock Cycle Time
§ A Machine is running at 100MHz
– Clock rate = 100 MHz = 100 * 106 cycles / sec
…….
Time
Positive state visited 108 times for 1 sec
Time
10ns
8
How is clock cycle time determined?
§ Closely related to logic design
Storage element
Storage element
…
…
…
…
Clock
10
Measuring clock cycles
§ CPU clock cycles / program is not so intuitive.
– It is very difficult to estimate # of CPU clock cycles for your program.
11
Von Neumann Architecture
§ By John von Neumann, 1945
Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit
14
Example Solution
§ Solve CPU time for each machine
– Execution timeA = I (= # of Instructions) * 2.0 * 1 ns = 2.0 * I ns
– Execution timeB = I (= # of Instructions) * 1.2 * 2 ns = 2.4 * I ns
§ Compare performance
PerformanceA Execution timeB 2.4 * I ns
= = = 1.2
PerformanceB Execution timeA 2.0 * I ns
16
CPI from Instruction Mix
n
∑ ( CPIi x Ci )
§ CPI =
i=1
§ CPI Example
Instruction Class Frequency CPIi
ALU operations 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Compiler
Operating System
ISA
Instruction mix
Data path / control
Logic / transistors
Clock rate
18
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instruction
CPI Clock rate
count
Program ∨ Not just about the program algorithms
Compiler ∨
ISA ∨ ∨
Organization ∨ ∨
Technology ∨
19
Another Popular Performance Metrics
§ MIPS (million instructions per second)
– MIPS = Instruction Count
Execution Time x 106
– Problems
– It does not take into account the instruction set.
20
A Wrong Use Case of MIPS
§ Consider a 500 MHz Machine
Class CPI
Class A 1
Class B 2
Class C 3
§ Execution Time
– Execution time = (Instruction count x CPI) / Clock rate
= Clock cycles / Clock rate
– Executioncomp1 = 10M / 500M = 0.02 s
– Executioncomp2 = 15M / 500M = 0.03 s
23
Benchmarks
§ Users often want a performance metric.
§ A benchmark is distillation of the attributes of a workload.
– Real applications usually work best, but using them is not always feasible.
§ Desirable attributes
– Relevant: meaningful within the target domain
– Understandable
– Good metric(s)
– Scalable
– Coverage: does not oversimplify important factors in the target domain
– Acceptance: vendors and users embrace it
24
Benchmarks (cont’d)
§ De-facto industry standard benchmarks for CPU
– SPEC
§ SPEC
– Standard Performance Evaluation Cooperative
– Founded in 1988 by EE times, SUN, HP, MIPS, Apollo, DEC
– Several different SPEC benchmarks
– Most include a suit of several different applications (such as integer and
floating point components often reported separately)
– For more information, visit http://www.spec.org
25
Amdahl’s Law
Execution speedup is proportional to the size of the
improvement and the amount affected
§ Execution time after improvement
Execution time affected by improvement
= Amount of improvement + Execution time unaffected
§ Or
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
ExTimeold
Speedupoverall = = 1.053
0.95 x ExTimeold
ExTimeold
Speedupoverall =
ExTimenew
27
Example – Amdahl’s Law
n
∑ ( CPIi x Ci )
§ CPI =
i=1
§ CPI Example
Instruction Class Frequency CPIi
ALU operations 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
1
Computer Architecture
1. Introduction
2
A Variety of Computer Systems
3
Classes of Computers
Personal Computers Super Computers
– General purpose running – High-end scientific and
a variety of software engineering calculations
– Subject to cost/performance + energy – Highest capability but accounting
tradeoff for a small fraction of the overall
computer market
4
Opening the Box (iMac Pro)
5
Components of a Computer
CPU: Control + Datapath
Memory Focus of This Course
I/Os
– GPUs
– User-Interface Devices:
Display Keyboard Mouse Sound
– Storage Devices:
HDD SSD
– Network Adapters:
Ethernet, 3G/4G/5G, Wi-Fi Bluetooth
6
Von Neumann Architecture
By John von Neumann, 1945
Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit
– Contains the arithmetic and control logic circuitry to perform the operations of
computer programs
8
Intel x86 History
4-bit 8-bit 16-bit 32-bit (i386)
2009 2011
1st Gen. Core i72nd Gen. Core i7
(Nehalem) (Sandy Bridge)
2012 2013
9
Computer Ads.
10
The Moore s Law
The number of transistors
incorporated in a chip will
approximately double every 24
months
(Gordon Moore, Intel Co-founder, 1965)
13
Architecture: Exploiting Parallelism
Instruction Level Parallelism (ILP)
– Pipelining
– Superscalar
– Out-of-order Execution
– Branch Prediction
– VLIW (Very Long Instruction Word)
14
Shift to Multicore
ILP Wall
– Control Dependency
– Data Dependency
Memory Wall
– Memory Latency
Improved by
10% / year
– Cache Shows
Diminishing Returns
Power Wall
15
Intel s Core Duo
2 Cores on one chip
Two levels of caches
DL1 DL1
(L1, L2) on chip Core0 Core1
291 million transistors
IL1 IL1
in 143 mm2 with 65nm
technology
L2 Cache
Source: http://www.sandpile.org
16
Intel s Core i
4 Cores on one chip
Three levels of caches
(L1, L2, L3) on chip
731 million transistors
in 263 mm2 with 45nm
technology
17
Intel s Core i rd Gen.)
3rd Generation
Core i7
L1 64 KB
L2 256 KB
L3 8MB
18
How to Run Your Program?
19
Instruction Set Architecture (ISA)
The Hardware/Software Interface
– Hardware abstraction visible to software OS Compilers
– Instructions and their encodings, registers, data types, addressing modes, etc.
– Written documents about how the CPU behaves
20
A Case for Robot Cleaner
Commands (or Instructions)
– Go one step forward 00
– Go one step backward 01
– Turn left 10
– Turn right 11
A Simple Program
21
Von Neumann Architecture
By John von Neumann, 1945
Address
CPU Memory
Data
Processing Storage
Unit Instruction Unit
23
Abstraction is Good But
Abstraction helps us deal with complexity
– Hide lower-level details
24
Topic 1: How to Design Interface?
Choices critically affect both the software programmer and hardware
designer
Example: Copying n bytes from address A to B
26
Topic 3: What About the Memory?
It s too slow!
27
Expectations
Hopefully, you will have a lot of fun throughout this class!
– It would take substantial time to finish your project though..
28
Computer Architecture
4. The Processor: Datapath and Control
1
Computer Organization
Computer
Processor Memory Devices
Control Input
Datapath Output
2
Big Picture: Processor Implementation
§ Key ideas
– Concept of datapath and control
– Where the instruction and data bits go
§ Approach
– Start with a simple implementation and iteratively improve it
3
Subset of Instructions
§ To simplify out study of processor design, we will focus on
simple subset of RV32I, yet covers most aspects
– Data transfer: lw, sw
– Arithmetic/logical: add, sub, and, or
– Control transfer: beq
4
RISC-V Format Review
5
Execution Cycle
§ The lifecycle of an instruction
Instruction
Fetch § Obtain instruction from instruction memory
Instruction
§ Determine required actions
Decode
Operand
§ Locate and obtain operand data
Decode
Result
§ Deposit results in storage for later use
Store
Instruction
§ Determine the following instruction
Next
6
Implementation Overview
§ Data flows through memory and functional units
Registers
Register #
ALU
PC Address Address
Register #
Instruction Register #
Data
7
Digital Systems
§ Three components required to implement a digital system
– Combinational elements
– Output is dependent only on current inputs
– E.g., ALU
– Sequential elements
– Element contains state information (i.e., memory element)
– E.g., registers
– Clock signals regulate the updating of the memory elements
8
Combinational Elements
§ Acyclic network of logic gates
– Continuously responds to changes on primary inputs
– Primary outputs become (after some delay) functions of primary inputs
9
Combinational Elements: Examples
§ AND-gate § Adder
– Y=A&B – Y=A+B
10
Storage Elements: Register
§ Register
– Based on the D Flip Flops
– N-bit input and output
– Write enable input
– Write Enable:
– 0: Data Out will not change
– 1: Data Out will become Data in
11
Storage Elements: Register File
§ Register File consists of 32 registers
– Two 32-bit output busses:
– Read data1 and Read data2
– One 32-bit input bus: Write data
– x0 hard-wired to value 0
CLK
§ Register is selected by:
– Read register1 selects the register to put on Read data1
– Read register2 selects the register to put on Read data2
– Write register selects the register to be written with Write data when
RegWrite = 1
§ Address
– Selects the word to put on Data Out when MemRead = 1
– The word to be written via the Data In when MemWrite = 1
13
Instruction Fetch (IF) RTL
§ Common RTL operations
– Fetch instruction
– Mem[PC]; // fetch instruction from instruction memory
– Update program counter
– PC <- PC + 4; // calculate next address
14
Datapath: IF Unit
15
Add RTL
§ Add instruction
– add rd, rs1, rs2
– Mem[PC]; // fetch instruction from instruction memory
– R[rd] <- R[rs1] + R[rs2] // ADD instruction
– PC <- PC + 4; // calculate next address
16
Datapath: Reg/Reg Operations
§ R[rd] <- R[rs1] op R[rs2];
– ALU control and RegWrite based on decoded instruction
– Read register1, read register2, and write register from rs1, rs2, rd fields
17
OR Immediate RTL
§ OR immediate instruction
– ori rd, rs1, imm12
– Mem[PC]; // fetch instruction from instruction memory
– R[rd] <- R[rs1] OR SignExt[imm12] // OR operation with Sign-Extension
– PC <- PC + 4; // calculate next address
110 0010011
18
Datapath: Immediate Operations
§ 1 Mux and Immediate Generation Unit are added
– Mux: selects data from register if R-format by ALUSrc
ALUSrc
0
M
u
x
1
imm12 Imm
Gen
19
Load RTL
§ Load instruction
– lw rd, imm12(rs1)
– Mem[PC]; // fetch instruction from instruction memory
– Addr <- R[rs1] + SignExt(imm12); // Compute memory address
– R[rd] <- Mem[Addr]; // Load data into register
– PC <- PC + 4; // calculate next address
010 0000011
20
Datapath: Load
§ Sign extension logic is added
– Offset can be either positive or negative
ALUSrc
0
M
u
x
1
Imm
Gen
21
Store RTL
§ Store instruction
– sw rs2, imm12(rs1)
– Mem[PC]; // fetch instruction from instruction memory
– Addr <- R[rs1] + SignExt(imm12); // Compute memory address
– Mem[Addr] <- R[rs2]; // Store data into memory
– PC <- PC + 4; // calculate next address
010 0100011
22
Datapath: R-Type/Load/Store
§ A path from register to memory has been created
§ 1 Mux is added to select data to be stored in registers
23
Branch RTL
§ Branch instruction
– beq rs1, rs2, imm12
– Mem[PC]; // fetch instruction from instruction memory
– Zero <- (R[rs1] – R[rs2]) + 1; // Use ALU, subtract and check Zero output
– If (Zero == 1) then
– PC <- PC + (SignExt(imm12) << 1); // Branch if equal
– else
– PC <- PC + 4; // Keep going otherwise
000 1100011
24
Datapath: Branch
25
Putting It All Together
26
Adding Control
§ Design Steps
– Identify control points for pieces of the datapath
27
Single-Cycle Datapath
PCSrc
0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2
ALUOp
29
R-Type Instruction Dataflow
§ For add, sub, and, or instructions
PCSrc
0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2
ALUOp
30
R-Type Instruction Control
31
I-Type Load Instruction Dataflow
§ For lw instruction
PCSrc
0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2
ALUOp
32
I-Type Load Instruction Control
33
S-Type Store Instruction Dataflow
§ For sw instruction
PCSrc
0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2
ALUOp
34
S-Type Store Instruction Control
35
SB-Type Branch Instruction Dataflow
§ For beq instruction
PCSrc
0
M
Add U
ALU X
4 Add
result 1
RegWrite Shift
left 2
ALUOp
36
SB-Type Branch Instruction Control
37
ALU Control
§ ALU used for
– R-type: F depends on opcode
– Load/Store: F = add
– Branch: F = subtract
38
ALU Control (Cont’d)
§ Assume 2-bit ALUOp derived from opcode
– Combinational logic derives ALU control
39
Main Control Unit
§ Control signals derived from instruction
40
Put It All Together
41
Single Cycle Processor
§ Advantages
– Single cycle per instruction makes logic and clock simple
§ Disadvantages
– Cycle time is determined by the worst-case path
– Critical path: load instruction
– Instruction memory -> register file –> ALU -> data memory -> register file
– Not feasible to adapt to different instructions
– Different instruction can have different length of time
– Inefficient utilization of memory and functional units
1
Single Cycle Processor
§ Advantages
– Single cycle per instruction makes logic and clock simple
§ Disadvantages
– Cycle time is determined by the worst-case path
– Critical path: load instruction
– Instruction memory -> register file –> ALU -> data memory -> register file
– Not feasible to adapt to different instructions
– Different instruction can have different length of time
– Inefficient utilization of memory and functional units
Register
§ Locate and obtain operand data
Fetch
Memory
§ Load and/or store values from/to memory
Access
Write
§ Deposit results in storage for later use
Back
Instruction
§ Determine the following instruction
Next
4
Multicycle Divisions (Cont’d)
§ From datapath point of view
Registers
Instruction
Registers
PC Add
ALU Data
Result
Memory Memory
IF RF EX MEM WB
Instruction Register Execution Memory Write Back
Fetch Fetch
5
Multicycle Implementation Overview
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
7
Stage Description of Multicycle Design
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
8
Step1: Instruction Fetch
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
9
Step1: Instruction Fetch (Cont’d)
§ RTL Description
IR = Mem[PC]
PC = PC + 4
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
10
Step2: Instruction Decode and Register Fetch
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
11
Step2: Instruction Decode and Register Fetch (Cont’d)
§ RTL Description
A = Reg[IR[19-15]]
B = Reg[IR[24-20]]
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
12
Step2 Special Note
§ Control Lines
– Not dependent on instruction type
– Instruction is still being decoded at this step
– ALU used to calculate branch destination just in case we decode
a branch instruction
13
Step3: R-Type Execution
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
14
Step3: R-Type Execution
§ RTL Description
ALUOut = A op B
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
15
Step4: R-Type Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
16
Step4: R-Type Completion Step
§ RTL Description
Reg[IR[11-7]] = ALUOut
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
17
Step3: Branch Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
18
Step3: Branch Completion Step (Cont’d)
§ RTL Description
IR = Mem[PC]
PC = PC + 4
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
19
Step3: Memory Execution
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
20
Step3: Memory Execution (Cont’d)
§ RTL Description
ALUOut = A + sign-extend (imm12)
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
21
Step4: Load Memory Access Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
22
Step4: Load Memory Access Step (Cont’d)
§ RTL Description
MDR = Memory[ALUOut]
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
23
Step5: Load Completion Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
24
Step5: Load Completion Step (Cont’d)
§ RTL Description
Reg[IR[11-7]] = ALUOut
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
25
Step4: Store Memory Access Step
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
26
Step4: Store Memory Access Step (Cont’d)
§ RTL Description
Memory[ALUOut] = B
0 M
U
1 X
Instruction
Register
PC 0 Instruction Read 0
M [19-15] register1 Read M
U Address data 1 A U
X
Instruction Read
[24-20] register2
X Zero
1 Memory 1 ALUOut
ALU ALU
MemData Instruction Write Read result
[11-7] register data 2 B 0
M
Write Instruction Write Registers 4 1 U
data [31-0] data X
0 2
M
U
X
1
Memory
32 Imm 32
Data
Gen
Register
27
Summary of Multicycle Steps
Action for R-type Action for Memory Action for
Step Name
Instructions Instructions Branches
IR = Memory[PC]
Instruction Fetch
PC = PC + 4
A = Reg[IR[19-15]]
Instruction Decode/
B = Reg[IR[24-20]]
Register Fetch
ALUOut = PC + Branch Label (i.e., sign-extend (imm12 << 1))
Execution,
If (A==B) then
Address Computation, ALUOut = A op B ALUOut = A + sign-extend(imm12)
PC = ALUOut
Branch Completion
Memory Access, Load: MDR = Memory[ALUOut] or
Reg[IR[11-7]] = ALUOut
R-Type Completion Store: Memory[ALUOut] = B
Load Completion Load: Reg[IR[11-7]] = MDR
29
CPI of the Multicycle Implementation
§ Number of clock cycles
– Loads: 5
– Stores: 4
– R-format instructions: 4
– Branches (& Jumps): 3
§ Instruction mix
– 22% Loads, 11% Stores, 49% R-format instructions, 16% Branches, and 2%
Jumps
§ Disadvantages
– Requires additional registers to store between stages
– More timing paths to design, analyze, and tune
31
Summary
§ Processor design requires refinement of datapath and control
§ Multicycle Implementation
– Partition execution into small steps of comparable duration
32