0% found this document useful (0 votes)
6 views58 pages

Lecture1_Computer Abstractions and Technology v2

The document discusses the evolution of computer organization, highlighting key milestones such as the Hollerith Tabulating Machine, the Turing Machine, and the Von Neumann model. It explains the impact of Moore's Law on technology scaling and the classification of computers into categories like personal, server, and supercomputers. Additionally, it covers performance metrics, instruction count, and the factors influencing CPU performance.

Uploaded by

matt33768.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views58 pages

Lecture1_Computer Abstractions and Technology v2

The document discusses the evolution of computer organization, highlighting key milestones such as the Hollerith Tabulating Machine, the Turing Machine, and the Von Neumann model. It explains the impact of Moore's Law on technology scaling and the classification of computers into categories like personal, server, and supercomputers. Additionally, it covers performance metrics, instruction count, and the factors influencing CPU performance.

Uploaded by

matt33768.ee11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

計算機組織

Computer Organization

Computer Abstractions and


Technology

Kun-Chih (Jimmy) Chen 陳坤志


[email protected]
Institute of Electronics,
National Yang Ming Chiao Tung University

NYCU EE / IEE
The Computer Revolution

❖ Progress in computer technology


❖ Underpinned by Moore’s Law
❖ Makes novel applications feasible
❖ Computers in automobiles
❖ Cell phones
❖ Human genome project
❖ World Wide Web
❖ Search Engines
❖ Computers are pervasive

P2
Hollerith Tabulating Machine – CTR (1990)

❖ It opened the world’s eyes to the very idea of data processing.


❖ Beginning with the 1890 US census
❖ Save the data collection time from 144.5 hours to 72.5 hours
❖ Save the tabulating time from 55.5 hours to 5.5 hours

P3
From CTR to IBM (1896 → 1924)

❖ Herman Hollerith is the founder of Computing Tabulating Recording


(CTR)
❖ The main business is census (人口普查)
❖ Thomas J. Watson rebrand CTR to International Business Machines
Corporation (IBM)

P4
Turing Machine

❖ The idea of a universal computational device was first described by


Alan Turing in 1937.
❖ all computation could be performed by a special kind of a machine

P5
Von Neumann Model

❖ Around 1944–1945, John von Neumann proposed that


❖ since program and data are logically the same, programs should also be
stored in the memory of a computer.

Von Neumann with the first Institute computer


P6
Four Subsystems
❖ Computers built on the von Neumann model divide the computer
hardware into four subsystems
❖ Memory: 電路學、電子學、超大型積體電路
❖ arithmetic logic unit: 邏輯設計、計算機結構、數位系統設計、超大型積體
電路
❖ control unit: 邏輯設計、計算機結構、數位系統設計…
❖ Input/output

P7
ENIAC (1945)

❖ ENIAC (Electronic Numerical Integrator and Computer) was the first


programmable, electronic, general-purpose digital computer

P8
First Transistor (1947)

❖ In celebration of ENIAC’s 50th anniversary, the machine was


reimplemented using CMOS technology in 1995. The room—size
computer could now fit in the palm of your hand!

P9
Google’s first server (1999)

P10
Moore’s Law and Technology Scaling
❖ “… the performance of an IC, including the number components
on it, doubles every 18-24 months with the chip price …”
– Gordon Moore 1965

P11
Moore’s Law in Action
❖ Example:
❖ Intel Core i7 microprocessor evolution
➢ Six-core Core i7 (Gulftown), 2010
➢ 32nm technology
➢ 1,170 M transistors
➢ Die size: 240 mm2
➢ Transistor density: 4.875 M/mm2

➢ 10-core Core i7 (Broadwell-E), 2016


➢ 14nm technology
➢ 3,200 M transistors
➢ Die size: 246 mm2
➢ Transistor density: 13.008 M/mm2

Source: Intel
P12
Reference Size

❖ Cell size (diameter): 10 – 100 um


❖ Erythrocyte size (diameter): 6 – 9 um
❖ Protein size (diameter): 1 – 20 nm
❖ DNA molecule size (diameter): 2 nm
P13
IC 60 (1958 - 2018)

P14
Classes of Computers

❖ Personal computers
❖ General purpose, variety of software
❖ Subject to cost/performance tradeoff

❖ Server computers
❖ Network based
❖ High capacity, performance, reliability
❖ Range from small servers to building sized

❖ Supercomputers
❖ High-end scientific and engineering calculations
❖ Highest capability but represent a small fraction of the overall computer
market

❖ Embedded computers
❖ Hidden as components of systems
❖ Stringent power/performance/cost constraints
P15
Understanding Performance

❖ Algorithm
❖ Determines number of operations executed

❖ Programming language, compiler, architecture


❖ Determine number of machine instructions executed per operation

❖ Processor and memory system


❖ Determine how fast instructions are executed

❖ I/O system (including OS)


❖ Determines how fast I/O operations are executed

P16
Eight Great Ideas

❖ Design for Moore’s Law

❖ Use abstraction to simplify design

❖ Make the common case fast

❖ Performance via parallelism

❖ Performance via pipelining

❖ Performance via prediction

❖ Hierarchy of memories

❖ Dependability via redundancy

P17
Below Your Program
❖ Application software
❖ Written in high-level language

❖ System software
❖ Compiler: translates HLL code to machine code
❖ Operating System: service code
➢ Handling input/output
➢ Managing memory and storage
➢ Scheduling tasks & sharing resources

❖ Hardware
❖ Processor, memory, I/O controllers

P18
Levels of Program Code
❖ High-level language
❖ Level of abstraction closer to problem
domain
❖ Provides for productivity and portability

❖ Assembly language
❖ Textual representation of instructions

❖ Hardware representation
❖ Binary digits (bits)
❖ Encoded instructions and data

P19
Semiconductor Technology

❖ Silicon: semiconductor
❖ Add materials to transform properties:
❖ Conductors
❖ Insulators
❖ Switch

P20
Technology Trends

❖ Electronics technology
continues to evolve
❖ Increased capacity and
performance
❖ Reduced cost

DRAM capacity

Year Technology Relative performance/cost


1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000

P21
VLSI IC Technology

2001 2005 2010 2016

Line width (nm) 130 80 45 22

Clock (GHz) 1.7 5.2 11.5 28.8


DRAM cost
7.7 1.9 0.34 0.042
(microcents/bit)
MPU cost
97 24 4.31 0.54
(microcent/trans)
Supply voltage(V) 1.2 1.0 0.8 0.6

Wiring levels 7 9 10 10

cost per transistor chip density


P22
Manufacturing ICs

❖ Yield: proportion of working dies per wafer

P23
Intel Core i7 Wafer

❖ 300mm wafer, 280 chips, 32nm technology


❖ Each chip is 20.7 x 10.5 mm

P24
Integrated Circuit Cost

Cost per wafer


Cost per die =
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
1
Yield =
(1+ (Defects per area  Die area/2)) 2

❖ Nonlinear relation to area and defect rate


❖ Wafer cost and area are fixed
❖ Defect rate determined by manufacturing process
❖ Die area determined by architecture and circuit design

P25
Performance

❖ Purchasing perspective: Given a collection of machines, which has the


❖best performance? least cost?
❖best performance/cost?

❖ Design perspective: Faced with design options, which has the


❖best performance improvement? least cost?
❖best performance improvement/cost?

❖ Both require
❖basis for comparison
❖metric for evaluation

❖ Goal: understand cost and


performance implications of
architectural choices
P26
Performance Measures

❖ Typical performance metrics


❖ Response time (latency): time between start and completion
❖ Throughput (bandwidth) -- rate: work done per unit time

❖ Speedup of X relative to Y
❖ Execution timeY / Execution timeX
❖ B is n times faster than A
➢ Means exec_time_A/exec_time_B == rate_B/rate_A

❖ Execution time
❖ Wall clock time (or response time, or elapsed time): includes all system
overheads
❖ CPU time: only computation time

P27
Measuring Performance

❖ Time is the measure of computer performance

❖ Elapsed time (Wall-clock time) = program execution + I/O + wait


❖ important to user

❖ Execution time = user time + system time (but OS self measurement


may be inaccurate)

❖ CPU performance = user time on unloaded system


❖ important to architect

P28
Real Performance Measurement

❖ Benchmark suites
❖ Attempts at running programs that ate much simpler than a real application
have led to performance pitfalls

❖ Performance is the result of executing a workload on a configuration

❖ Workload = program + input

❖ Configuration = CPU + cache + memory + I/O + OS + compiler


optimizations

❖ compiler optimizations can make a huge difference!

P29
Improve Performance by

❖ Changing the
❖ algorithm (ex. Bubble sorting vs. Quick sorting)
❖ data structures (ex. Structure vs. pointer)
❖ programming language (ex. C vs. Matlab)
❖ compiler
❖ compiler optimization flags
❖ OS parameters (ex. Windows vs. Linux)

❖ Improving locality of memory or I/O accesses

❖ Overlapping I/O (ex. Single direction port vs. bi-direction port)

❖ On multiprocessors, you can improve performance by avoiding cache


coherency problems (e.g., false sharing) and synchronization
problems

P30
CPU Clocking
❖ Operation of digital hardware governed by a constant-rate clock

❖ Clock period: duration of a clock cycle


❖ e.g., 250ps = 0.25ns = 250×10–12s

❖ Clock frequency (rate): cycles per second


❖ e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Clock period

Clock (cycles)

Data transfer
and computation
Update state

P31
CPU Time

CPU Time = CPU Clock Cycles  Clock Cycle Time


CPU Clock Cycles
=
Clock Rate
❖ Performance improved by
❖ Reducing number of clock cycles
❖ Increasing clock rate
❖ Hardware designer must often trade off clock rate against cycle count

P32
CPU Time Example
❖ Computer A: 2GHz clock, 10s CPU time
❖ Designing Computer B
❖ Aim for 6s CPU time
❖ Can do faster clock, but causes 1.2 × clock cyclesA
❖ How fast must Computer B clock be?

Clock CyclesB 1.2  Clock CyclesA


Clock RateB = =
CPU Time B 6s
Clock CyclesA = CPU Time A  Clock Rate A
= 10s  2GHz = 20  109
1.2  20  109 24  109
Clock RateB = = = 4GHz
6s 6s
P33
Instruction Count and CPI

Clock Cycles = Instruction Count  Cycles per Instruction


CPU Time = Instruction Count  CPI  Clock Cycle Time
Instruction Count  CPI
=
Clock Rate
❖ Instruction Count for a program
❖ Determined by program, ISA and compiler
❖ Average cycles per instruction
❖ Determined by CPU hardware
❖ If different instructions have different CPI
➢ Average CPI affected by instruction mix

P34
What is Instruction Count?

❖ How many C operations?

for(i=0; i<10; i++)


a[i] = b[i] * c[i];

❖ How many Assembly instructions?

sub $r1,$r2,$r3
Loop: beq $r9,$r0,End
add $r8,$r8,$r10
addi $r9,$r9,-1 10 times => 41 instructions
j Loop
End:
Dynamic Instruction Count

P35
CPI Example
❖ Computer A: Cycle Time = 250ps, CPI = 2.0
❖ Computer B: Cycle Time = 500ps, CPI = 1.2
❖ Same ISA
❖ Which is faster, and by how much?

CPU Time = Instruction Count  CPI  Cycle Time


A A A
= I  2.0  250ps = I  500ps A is faster…
CPU Time = Instruction Count  CPI  Cycle Time
B B B
= I  1.2  500ps = I  600ps
CPU Time
B = I  600ps = 1.2
…by this much
CPU Time I  500ps
A

P36
CPI in More Detail
❖ If different instruction classes take different numbers of cycles

n
Clock Cycles =  (CPIi  Instruction Counti )
i=1

❖ Weighted average CPI

Clock Cycles n
 Instruction Count i 
CPI = =   CPIi  
Instruction Count i=1  Instruction Count 

Relative frequency
Instruction Frequency

P37
CPI Example
❖ Alternative compiled code sequences using instructions in classes A, B,
C

Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1

❖ Sequence 1: IC =5 ❖ Sequence 2: IC = 6
❖ Clock Cycles ❖ Clock Cycles
= 2×1 + 1×2 + 2×3 = 10 = 4×1 + 1×2 + 1×3
❖ Avg. CPI = 10/5 = 2.0 =9
❖ Avg. CPI = 9/6 = 1.5

P38
Performance Summary

CPU execution time for program (designer’s view)

= Clock Cycles for program x Clock Cycle Time

❖ Substituting for clock cycles:

CPU execution time for program (user’s view)

= Instruction Count x CPI x Clock Cycle Time

CPU time = Instructions  Cycles  Seconds


Program Instruction Cycle

P39
Performance Summary

The BIG Picture

Instructio ns Clock cycles Seconds


CPU Time =  
Program Instructio n Clock cycle

❖ Performance depends on
❖ Algorithm: affects IC, possibly CPI
❖ Programming language: affects IC, CPI
❖ Compiler: affects IC, CPI
❖ Instruction set architecture: affects IC, CPI, Tc (i.e., clock cycle time)

P40
MIPS as a Performance Measure

❖ MIPS: Million Instructions Per Second

Instruct. count
MIPS =
Execution time  106
Instruct. count Clock rate
= =
Instruct. count  CPI CPI  10 6
 10 6

Clock rate

P41
Pitfall: MIPS as a Performance Metric
❖ MIPS: Millions of Instructions Per Second
❖ Doesn’t account for
➢ Differences in ISAs between computers
➢ Differences in complexity between instructions

Instructio n count
MIPS =
Execution time  106
Instructio n count Clock rate
= =
Instructio n count  CPI CPI  10 6
 10 6

Clock rate

➢ CPI varies between programs on a given CPU

P42
Example

❖ Consider two compilers with three instruction classes, A, B , and


C, and the CPIs of them are 1, 2, and 3, respectively. Suppose
we measure the code for the same program from two different
compilers and obtain the following data:

Code from Instruction counts (in billions=109)


A B C
Compiler 1 5 1 1
Compiler 2 10 1 1

❖ Assume that the machine’s clock rate is 4GHz. Which code


sequence will execute faster according to MIPS rate? According
to execution time?

P43
Example
Code from Instruction counts (in billions)
A B C
Compiler 1 5 1 1
Compiler 2 10 1 1
CPI 1 2 3

❖ CPU clock cycle1 = (5*1+1*2+1*3)*109 =10*109


❖ CPU clock cycle2 = (10*1+1*2+1*3)*109 =15*109
❖ Execution time1 = (10*109) / (4*109) = 2.5 Sec
❖ Execution time2 = (15*109) / (4*109) = 3.75 Sec
❖ MIPS = IC / (Exe time * 106 )
❖ MIPS1 = [(5+1+1) *109] / (2.5 * 106 ) = 2800
❖ MIPS2 = [(10+1+1) *109] / (3.75 * 106 ) = 3200

P44
Amdahl’s Law: A method to quantify
performance speedup

Gene Amdahl
Computer Pioneer

9/8/2024
P45
Amdahl’s Law

❖ Amdahl’s law gives us a quick way to find the speedup from some
enhancement, which depends on two factors:
❖ Fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement
➢ How much fraction of the computation time can be enhanced?
➢ The fraction is always less than 1
❖ How much faster the task would run if the enhanced mode were used for
the entire program
➢ Speedup = (execution time without enhance.) / (execution time with enhance.)
= (time without) / (time with) = Tw/o / Tw/

P46
Example

Suppose we are considering an enhancement to a web server. The


enhanced CPU is 10 times faster on computation but the same speed on
I/O. Suppose also that 60% of the time is waiting on I/O

Answer:
Fractionenhanced=0.4
Speedupenhanced=10
1 1
Speedupoverall= 0.4 = ≈ 1.56
0.6+ 10 0.64

P47
Power Trends

❖ In CMOS IC technology

Power = Capacitive load  Voltage 2  Frequency

×40 5V → 1V ×1000

P48
Reducing Power
❖ Suppose a new CPU has
❖ 85% of capacitive load of old CPU
❖ 15% voltage and 15% frequency reduction

Pnew Cold  0.85  (Vold  0.85) 2  Fold  0.85


= 2
= 0.85 4
= 0.52
Pold Cold  Vold  Fold

❖ The power wall


❖ We can’t reduce voltage further
❖ We can’t remove more heat

❖ How else can we improve performance?

P49
Power and Energy

❖ Problem: Get power in, get power out

❖ Thermal Design Power (TDP)


❖ Characterizes sustained power consumption
❖ Lower than peak power, higher than average power consumption
❖ Used as target for power supply and cooling system
➢ Two kinds of thermal management in the modern processors

❖ Clock rate can be reduced dynamically to limit power consumption

❖ Energy per task is often a better measurement

P50
Case Study: Profile of Intel i7-8850H

P51
Power

❖ Intel 80386 consumed ~


2W

❖ 3.3 GHz Intel Core i7


consumes 130 W

❖ Heat must be dissipated


from 1.5 x 1.5 cm chip

❖ This is the limit of what


can be cooled by air

P52
Power vs. Thermal Issue

❖ Distributing the power, removing the heat, and preventing hot spot
have become increasingly difficult challenges.
❖ Techniques for reducing power
❖ Do nothing well
❖ Dynamic Voltage-Frequency Scaling
❖ Low power state for DRAM, disks
❖ Overclocking, turning off cores

Leakage Current Increase 5% per 10°C

P53
Case Study: Intel© Turbo Boost Technology 2.0

❖ Features
❖ Under a Thermal Design Power(TDP), maximize the frequency state
depends on the workload and:
➢ Number of active cores
➢ Estimated current consumption
➢ Estimated power consumption
➢ Processor temperature

❖ Commercial Products
❖ Intel Core i7-2xxx series
❖ Intel Core i5-2xxx series
❖ Intel Xeon E3-12xx series

Near zero power for inactive core


P54
Single Processor Performance
Move to multi-processor

RISC

P55
Multiprocessors

❖ Multicore microprocessors
❖ More than one processor per chip

❖ Requires explicitly parallel programming


❖ Compare with instruction level parallelism
➢ Hardware executes multiple instructions at once
➢ Hidden from the programmer
❖ Hard to do
➢ Programming for performance
➢ Load balancing
➢ Optimizing communication and synchronization

P56
Why Do I Want to Know These?

I/O Chan
Link
API
ISA
Technology Interfaces (ISA)

Historic IR

Background, Regs
Trend
Machine Organization

Computer
Applications Architect Measurement &
Evaluation

P57
In Fact, Architecture Design Is an Iterative
Process

-- searching the space of possible designs


-- at all levels of computer systems
New concepts
created

Estimate
Cost & Sort
Performance

Historical background
and understanding of
trends help the Good
selection process Mediocre ideas
ideas

P58

You might also like