Performance
Performance
of Computer Systems
Performance
• Measure, Report, and Summarize
• Make intelligent choices
• See through the marketing hype
• Key to understanding underlying organizational motivation
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
Which of these airplanes has the best performance?
8
Definition of Performance
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is n times faster than Y"
Performance (X)
n = ––––––––––––––
Performance (Y)
• Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
Clock Cycles
• Instead of reporting execution time in seconds, we often use cycles
seconds cycles seconds
program program cycle
time
So, to improve performance (everything else being equal) you can either (increase or decrease?)
• 4th
5th
6th
...
time
Why? hint: remember that these are machine instructions, not lines of C code
Different numbers of cycles for
different instructions
time
• If two machines have the same ISA which of our quantities (e.g., clock
rate, CPI, execution time, # of instructions, MIPS) will always be
identical?
# of Instructions Example
• A compiler designer is trying to decide between two code
sequences for a particular machine. Based on the hardware
implementation, there are three different classes of
instructions: Class A, Class B, and Class C, and they
require one, two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of A, 1 of B,
and 2 of C
The second sequence has 6 instructions: 4 of A, 1 of B,
and 1 of C.
Which sequence will be faster? How much?
What is the CPI for each sequence?
MIPS example
• Two different compilers are being tested for a 4 GHz. machine with
three different classes of instructions: Class A, Class B, and Class C,
which require one, two, and three cycles (respectively). Both
compilers are used to produce code for a large piece of software.
The first compiler's code uses 5 million Class A instructions, 1 million
Class B instructions, and 1 million Class C instructions.
The second compiler's code uses 10 million Class A instructions, 1
million Class B instructions, and 1 million Class C instructions.
21
Sequential Execution of 3 LW Instructions
In s tru c ti o n D a ta
lw r 2 , 2 0 0 ( r 0 ) 8 ns R eg A LU R eg
fe tc h a c ce ss
In s tru c ti o n
lw r 3 , 3 0 0 ( r 0 ) 8 ns
fe tc h
...
8 ns
22
CPU Time: Example 1
23
CPU Time: Example 1 (continued)
• a. Approach 1:
Clock cycles for a program = (x×3 + y×2 + z×4 + w×5)
= 910 × 106 clock cycles
CPU_time = Clock cycles for a program / Clock rate
= 910 × 106 / 500 × 106 = 1.82 sec
• b. Approach 2:
CPI = Clock cycles for a program / Instructions count
CPI = (x×3 + y×2 + z×4 + w×5)/ (x + y + z + w)
= 3.03 clock cycles/ instruction
CPU time = Instruction count × CPI / Clock rate
= (x+y+z+w) × 3.03 / 500 × 106
= 300 × 106 × 3.03 /500 × 106
= 1.82 sec
24
CPU Time: Example 2
Consider another implementation of MIPS ISA with 1 GHz clock
and
– each ALU instruction takes 4 clock cycles,
– each branch/jump instruction takes 3 clock cycles,
– each sw instruction takes 5 clock cycles,
– each lw instruction takes 6 clock cycles.
Also, consider the same program as in Example 1.
Find CPI and CPU time.
CPI = (x×4 + y×3 + z×5 + w×6)/ (x + y + z + w)
= 4.03 clock cycles/ instruction
CPU time = Instruction count × CPI / Clock rate
= (x+y+z+w) × 4.03 / 1000 × 106
= 300 × 106 × 4.03 /1000 × 106
= 1.21 sec
25
Analysis of CPU Performance Equation
26
Calculating Components of CPU time
• For an existing processor it is easy to obtain the CPU time (i.e.
the execution time) by measurement, and the clock rate is
known. But, it is difficult to figure out the instruction count or
CPI.
27
Attempting to Calculate CPI
The table below indicates frequency of all instruction types execu-
ted in a “typical” program and, from the reference manual, we are
provided with a number of cycles per instruction for each type.
Instruction Type Frequency Cycles
ALU instruction 50% 4
Load instruction 30% 5
Store instruction 5% 4
Branch instruction 15% 2
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
Sequential laundry takes 6 hours for 4 loads;
If Dave learned pipelining, how long would laundry take?
30
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T A
a
s
k B
O
r
C
d
e D
r
( in in s t r u c t io n s )
I n s t r u c t io n D a ta
lw r 1 , 1 0 0 ( r 0 ) R eg A LU R eg
fe tc h access
I n s t r u c t io n D a ta
lw r 2 , 2 0 0 ( r 0 ) 2 ns R eg A LU R eg
fe tc h a cc e s s
I n s t r u c t io n D a ta
lw r 3 , 3 0 0 ( r 0 ) 2 ns R eg A LU R eg
fe tc h a cc e s s
2 ns 2 ns 2 ns 2 ns 2 ns
Note that registers are written during the first part of a cycle and
read during the second part of the same cycle.
• Pipelining doesn’t help to execute a single instruction, it may
improve performance by increasing instruction throughput;
32
Quantitative Performance Measures
• The original performance measure was time to perform an individual
instruction, e.g. add. Instructions took the same time, appropriate.
• Next performance measure was the average instruction time,
obtained from the relative frequency of instructions in some typical
instruction mix and times to execute each instruction. Since
instruction sets were similar, this was a more accurate comparison.
• One alternative to execution time as the metric was MIPS – Million
Instructions Per Second. For a given program MIPS rating is simple:
34
Benchmark Suites
It has become popular to put together collection of benchmarks
to try to measure the performance of processors.
Benchmarks could be:
– real programs;
– modified (or scripted) applications;
– kernels – small, key pieces from real programs;
– synthetic benchmarks – not real programs, but codes try to
match the average frequency of operations and operands of
a large set of programs.
• SPEC (Standard Performance Evaluation Corporation) was
founded in late 1980s to try to improve the state of bench-
marking and make more valid base for comparison of desk
top and server computers.
35
Benchmarks
• Performance best determined by running a real application
– Use programs typical of expected workload
– Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
• Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
• SPEC (System Performance Evaluation Cooperative)
– companies have agreed on a set of real program and inputs
– valuable indicator of performance (and compiler technology)
– can still be abused
SPEC Benchmark Suites
37
Benchmark Games
• An embarrassed Intel Corp. acknowledged Friday that a bug in a software
program known as a compiler had led the company to overstate the speed of
its microprocessor chips on an industry benchmark by 10 percent. However,
industry analysts said the coding error…was a sad commentary on a common
industry practice of “cheating” on standardized performance tests…The
error was pointed out to Intel two days ago by a competitor,
Motorola …came in a test known as SPECint92…Intel acknowledged that it
had “optimized” its compiler to improve its test scores. The company had
also said that it did not like the practice but felt to compelled to make the
optimizations because its competitors were doing the same thing…At the
heart of Intel’s problem is the practice of “tuning” compiler programs to
recognize certain computing problems in the test and then substituting
special handwritten pieces of code…
38
SPEC ‘89
• Compiler “enhancements” and performance
800
700
600
SPEC performance ratio
500
400
300
200
100
0
gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv
Benchmark
Compiler
Enhanced compiler
SPEC CPU2000
Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +( Execution Time Affected / Amount of
Improvement )
• Example:
"Suppose a program runs in 100 seconds on a machine, with
multiply responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program to
run 4 times faster?"
How about making it 5 times faster?
where k1 is a coefficient and CPU timei is the CPU time for the
ith integer program of a total of 12 programs in the workload.
Similarly for floating point performance, CFP2000 is given as:
14
14
CFP2000 = k2 × П 1/FPExecution timei
i=1
45
Performance Example (part 1/5)
46
Performance Example (part 2/5)
• Machine with No Floating Point Hardware - MNFP does
not support real number instructions, but all its
instructions are identical to non-real number instructions
of MFP. Each MNFP instruction (including integer
instructions) takes 2 clock cycles. Thus, MNFP is
identical to MFP without real number instructions.
• Any real number operation (in a program) has to be
emulated by an appropriate software subroutine (i.e.
compiler has to insert an appropriate sequence of
integer instructions for each real number operation). The
number of integer instructions needed to implement each
real number operations is as follows:
– real number multiply needs 30 integer instructions
– real number add needs 20 integer instructions
– real number divide needs 50 integer instructions
47
Performance Example (part 3/5)
Consider Program P with the following mix of operations:
– real number multiply 10%
– real number add 15%
– real number divide 5%
– other instructions 70%
a. Find MIPS rating for both machine.
CPIMFP = 0.1×6 + 0.15×4 + 0.05×20 + 0.7×2
= 3.6 clocks/instr
CPIMNFP = 2
clock rate 1000*106
MIPSMFP rating = -------------- = ----------- = 270.3
CPI * 106 3.6*106
MIPSMNFP rating = 500
According to MIPS rating, MNFP is better than MFP!?
48
Performance Example (part 4/5)
b. If Program P on MFP needs 300,000,000 instructions, find
time to execute this program on each machine.
49
Performance Example (part 5/5)
c. Calculate MFLOPS for both computers.
Number of floating point operations in a program
MFLOPS = ––––––––––––––––––––––––––––––––––––––––
Execution time * 106
50
• Machine With Floating Point Hardware - MFP
– real number multiply instruction requires 6 clock cycles
– real number add instruction requires 4 clock cycles
– real number divide instruction requires 20 clock cycles
Any other instruction (including integer instructions)
requires 2 clock cycles