CS 294-73
Software Engineering for
Scientific Computing
[email protected]
[email protected]
Lecture 1: Introduction
Grading
• 5-6 homework assignments, adding up to 60% of the grade.
• The final project is worth 40% of the grade.
- Project will be a scientific program, preferably in an area related to
your research interests or thesis topic.
- Novel architectures and technologies are not encouraged (they will
need to run on a standard Mac OS X or Linux workstation)
- For the final project only, you will self-organize into teams to develop
your proposal. Undergraduates may need additional help developing a
project proposal.
08/29/2019 CS294-73 - Lecture 1 2
Hardware/Software Requirements
• Laptop or desktop computer on which you have root permission
• Mac OS X or Linux operating system
- Cygwin or MinGW on Windows *might* work, but we have limited
experience there to help you.
• Installed software (this is your IDE)
- Gcc or clang
- GNU Make
- gdb or lldb
- Ssh
- VisIt
- Doxygen
- emacs
- LaTex
08/29/2019 CS294-73 - Lecture 1 3
Homework and Project submission
• Submission will be done via the class source code repository (git).
• On midnight of the deadline date the homework submission
directory is made read-only.
• We will be setting up times for you to get accounts.
08/29/2019 CS294-73 - Lecture 1 4
What we are not going to teach you in class
• Navigating and using Unix
• Unix commands you will want to know
- ssh
- scp
- tar
- gzip/gunzip
- ls
- mkdir
- chmod
- ln
• Emphasis in class lectures will be explaining what is really going on,
not syntax issues. We will rely heavily on online reference material,
available at the class website.
• Students with no prior experience with C/C++ are strongly urged to
take CS9F. 5
08/29/2019 CS294-73 - Lecture 1
What is Scientific Computing ?
We will be mainly interested in scientific computing as it arises in
simulation.
The scientific computing ecosystem:
• A science or engineering problem that requires simulation.
• Models – must be mathematically well posed.
• Discretizations – replacing continuous variables by a finite number
of discrete variables.
• Software – correctness, performance.
• Data – inputs, outputs.
• Hardware.
• People.
08/29/2019 CS294-73 - Lecture 1 6
What will you learn from taking this course ?
The skills and tools to allow you to understand (and perform) good
software design for scientific computing.
• Programming: expressiveness, performance, scalability to large
software systems (otherwise, you could do just fine in matlab).
• Data structures and algorithms as they arise in scientific
applications.
• Tools for organizing a large software development effort (build tools,
source code control).
• Debugging and data analysis tools.
08/29/2019 CS294-73 - Lecture 1 7
Why C++ ?
(Compare to Matlab, Python, ...).
• Strong typing + compilation. Catch large class of errors at compile
time, rather than at run time.
• Strong scoping rules. Encapsulation, modularity.
• Abstraction, orthogonalization. Use of libraries and layered
design.
C++, Java, some dialects of Fortran support these techniques to
various degrees well. The trick is doing so without sacrificing
performance. In this course, we will use C++.
- Strongly typed language with a mature compiler technology.
- Powerful abstraction mechanisms.
08/29/2019 CS294-73 - Lecture 1
Who should take this course ?
Students who don’t have the skills listed above, and expect to need them
soon.
• Expect to take CS 267.
• Building or adding to a large software system as part of your research.
• Interested in scientific computing.
• Interested in high-performance computing.
• Prior to this semester, EECS graduate students were not permitted to take
this course.
08/29/2019 CS294-73 - Lecture 1
A Cartoon View of Hardware
What is a performance model ?
• A “faithful cartoon” of how source code gets executed.
• Languages / compilers / run-time systems that allow you to
implement based on that cartoon.
• Tools to measure performance in terms of the cartoon, and close
the feedback loop.
08/29/2019 CS294-73 - Lecture 1
The Von Neumann Architecture / Model
Devices
CPU Memory
Instructions
registers
or data
• Data and instructions are equivalent in terms of the memory.
• Instructions are executed in a sequential order implied by the
source code.
• Really easy cartoon to understand, program to.
• The extent to which the cartoon is an illusion can have substantial
impact on the performance of your program.
08/29/2019 CS294-73 - Lecture 1 11
Memory Hierarchy
• Take advantage of the principle of locality to:
- Present as much memory as in the cheapest technology
- Provide access at speed offered by the fastest technology
Processor
Core Core Core
Tertiary
core cache core cache core cache Secondary Storage
Main Storage (Tape/
Controller
Memory
Shared Cache Second Memory
(Disk/ Cloud
O(106) Level (DRAM/ FLASH/ Storage)
Cache FLASH/
core cache core cache core cache PCM)
(SRAM) PCM)
Core Core Core
Latency (ns): ~1 ~5-10 ~100 ~107 ~1010
Size (bytes): ~106 ~109 ~1012 ~1015
08/29/2019 CS294-73 - Lecture 1
The Principle of Locality
• The Principle of Locality:
- Program access a relatively small portion of the address space at any
instant of time.
• Two Different Types of Locality:
- Temporal Locality (Locality in Time): If an item is referenced, it will tend
to be referenced again soon (e.g., loops, reuse)
- so, keep a copy of recently read memory in cache.
- Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
- Guess where the next memory reference is going to be based on
your access history.
• Processors have relatively lots of bandwidth to memory, but also very
high latency. Cache is a way to hide latency.
- Lots of pins, but talking over the pins is slow.
- DRAM is (relatively) cheap and slow. Banking gives you more bandwidth
08/29/2019 CS294-73 - Lecture 1
Programs with locality cache well ...
Memory Address (one dot per Bad locality behavior
Temporal
access)
Locality
Spatial
Locality
Time
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3): 168-192 (1971)
08/29/2019 CS294-73 - Lecture 1
Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X)
- Hit Rate: the fraction of memory access found in the upper level
- Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block
Y)
- Miss Rate = 1 - (Hit Rate)
- Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y
08/29/2019 CS294-73 - Lecture 1
Consequences for programming
• A common way to exploit spatial locality is to try to get stride-1
memory access
- Cache fetches a cache line worth of memory on each cache miss
- Cache line can be 32-512 bytes (or more)
• Each cache miss causes an access to the next deeper memory
hierarchy
- Processor usually will sit idle while this is happening
- When that cache-line arrives some existing data in your cache will be
ejected (which can result in a subsequent memory access resulting in
another cache miss. When this event happens with high frequency it is
called cache thrashing).
• Caches are designed to work best for programs where data access
has lots of simple locality.
08/29/2019 CS294-73 - Lecture 1 16
But processor architectures keep changing
• SIMD (vector) instructions: a(i) = b(i) + c(i), i = 1, … , 4 is as fast as
a0 = b0 + c0)
• Non-uniform memory access
• Many processing elements with varying performance
I will have someone give a guest lecture on this during the
semester. Otherwise, not our problem (but it will be in CS 267).
08/29/2019 CS294-73 - Lecture 1
Take a peek at your own computer
• Most UNIX machines
- >cat /etc/proc
• Mac
- >sysctl -a hw
08/29/2019 CS294-73 - Lecture 1 18
Seven Motifs of Scientific Computing
Simulation in the physical sciences and engineering is done out using
various combinations of the following core algorithms.
• Structured grids
• Unstructured grids
• Dense linear algebra
• Sparse linear algebra
• Fast Fourier transforms
• Particles
• Monte Carlo (We won’t be doing this one)
Each of these has its own distinctive combination of computation and
data access.
There is a corresponding list for data (with significant overlap).
08/29/2019 CS294-73 - Lecture 1 19
Seven Motifs of Scientific Computing
Dwarf&Algorithm&Classification&by&node&
hours
• Blue Waters usage patterns, in terms of motifs.
I/O
10%
Structured(
FFT Grid
16% 26%
Unstructured(Grid
Dense( 1%
Monte(Carlo
4% Matrix
N:Body
Sparse( 13%
16%
Matrix
14%
Figure 2.3-1 Colella’s seven dwarf classification of recognized applications run on Blue Wate
(by total node hours) in the study period, assuming equal weighting if an application is using
more than one algorithm in Table 10.0-1 in Appendix IV.
20
2.4#Numerical#Libraries##
08/29/2019 CS294-73 - Lecture 1
A “Big-O, Little-o” Notation
f = ⇥(g) if f = O(g) , g = O(f )
08/29/2019 CS294-73 - Lecture 1 21
Structured Grids
Used to represent continuously varying
quantities in space in terms of values on a
regular (usually rectangular) lattice.
= (x) ! i ⇡ (ih)
: B ! R , B ⇢ ZD
If B is a rectangle, data is stored in a contiguous block of memory.
B = [1, . . . , Nx ] ⇥ [1, . . . , Ny ]
i,j = chunk(i + (j 1)Nx )
Typical operations are stencil operations, e.g. to compute finite
difference approximations to derivatives.
1
L( )i,j = 2 ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h
Small number of flops per memory access, mixture of unit stride
and non-unit stride.
08/29/2019 CS294-73 - Lecture 1 22
Structured Grids
In practice, things can get much more = (x) ! i ⇡ (ih)
complicated: For example, if B is a union of : B ! R , B ⇢ ZD
rectangles, represented as a list.
To apply stencil operations, need to get values from neighboring
rectangles.
1
L( )i,j = 2 ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h
Can also have a nested hierarchy of grids, which means that
missing values must be interpolated.
Algorithmic / software issues: sorting, caching addressing
information, minimizing costs of irregular computation.
08/29/2019 CS294-73 - Lecture 1 23
Unstructured Grids
• Simplest case: triangular / tetrahedral
elements, used to fit complex geometries.
Grid is specified as a collection of nodes,
organized into triangles.
N = {xn : n = 1, . . . , Nnodes }
E = {(xen1 , . . . , xenD+1 ) : e = 1, . . . , Nelts }
• Discrete values of the function to be
represented are defined on nodes of the
grid.
= (x) is approximated by : N ⇥ R , n (xn )
• Other access patterns required to solve PDE problems, e.g. find all of
the nodes that are connect to a node by an element. Algorithmic issues:
sorting, graph traversal.
08/29/2019 CS294-73 - Lecture 1 24
Dense Linear Algebra
Want to solve system of equations
0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1
B a2,1 a2,2 a2,3 ··· a2,n C B x 2 C B b2 C
B CB C B C
B a3,1 a3,2 a3,3 ··· a3,n C B C B C
B C B x 3 C = B b3 C
B .. .. .. .. .. C B .. C B .. C
@ . . . . . A@ . A @ . A
an,1 an,2 an,3 ··· an,n xn bn
08/29/2019 CS294-73 - Lecture 1 25
Dense linear algebra
0 Gaussian elimination:
10 1 0 1 0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1 a1,1 a1,2 a1,3 ··· a1,n x1 b1
B a2,1 a2,2 a2,3 ··· a2,n C B C B C B 0 a2,n C B x2 C B b2 C
C B C B
B C B x 2 C B b2 C B a2,2 a2,3 ··· C
B a3,1 a3,2 a3,3 ··· a3,n C B C B C B 0 a3,n C B x 3 C B b3 C
B C B x 3 C = B b3 C B
B ..
a3,2 a3,3 ··· CB C = B C
C B .. C B .. C
B .. .. .. .. .. C B .. C B .. C @ .
.. .. ..
A@ . A @ . A
@ . . . . . A@ . A @ . A . . .
an,1 an,2 an,3 ··· an,n xn bn 0 an,2 an,3 ··· an,n xn bn
ak,1
ak,l := ak,l a1,l
a1,1
ak,1
bl := bl b1
a1,1
0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1 The pth row reduction costs 2 (n-p)2
B 0 a2,2 a2,3 ··· a2,n C B C B C
B C B x 2 C B b2 C + O(n) flops, so that the total cost
B 0 0 a3,3 ··· a3,n C B C B C
B C B x 3 C = B b3 C is nX1
B .. .. .. .. .. C B .. C B .. C
@ . . . . . A@ . A @ . A 2(n p)2 + O(n2 ) = O(n3 )
0 0 an,3 ··· an,n xn bn p=1
ak,2
ak,l := ak,l a2,l Good for performance: unit stride
a2,2
ak,2 access, and O(n) flops per word of
bl := bl b2
a2,2 data accessed. But, if you have to
write back to main memory...
08/29/2019 CS294-73 - Lecture 1 26
Sparse Linear Algebra
⎛1.5 0 0 0 0 0 0 0 ⎞
⎜ ⎟
⎜ 0 2.3 0 1.4 0 0 0 0 ⎟
⎜ 0 0 3.7 0 0 0 0 0 ⎟
⎜ ⎟
⎜ 0 − 1.6 0 2.3 9.9 0 0 0 ⎟
A=⎜ ⎟
⎜ 0 0 0 0 5. 8 0 0 0 ⎟
⎜ 0 0 0 0 0 7.4 0 0 ⎟
⎜ ⎟
⎜ 0 0 1.9 0 0 0 4.9 0 ⎟
⎜ 0 0 0 0 0 0 0 3.6 ⎟⎠
⎝
Want to store only non-zeros, so use compressed-sparse-row storage (CSR) format.
JA 1 2 4 3 2 4 5 5 6 3 7 8
StA 1.5 2.3 1.4 3.7 –1.6 2.3 9.9 5.8 7.4 1.9 4.9 3.6
IA 1 2 4 5 8 9 10 12 13
08/29/2019 CS294-73 - Lecture 1 27
Sparse Linear Algebra
• Matrix multiplication: indirect addressing.
Not a good fit for cache hierarchies.
IAk+1 1
X
(Ax)k = (StA)j xJAj , k = 1, . . . , 8
j=IAk
• Gaussian elimination: fills in any columm
below a nonzero entry all the way to the
diagonal. Can attempt to minimize this by
reordering the variables.
• Iterative methods for sparse matrices are based on applying the matrix to
the vector repeatedly. This avoids memory blowup from Gaussian
elimination, but need to have a good approximate inverse to work well.
08/29/2019 CS294-73 - Lecture 1 28
Fast Fourier Transform (Cooley and Tukey, 1965)
We also have
FkP (x) = Fk+P
P
(x)
So the number of flops to compute F N (x) is 2 N, given that you have
F N/2 (E(x)) , F N/2 (O(x))
08/29/2019 CS294-73 - Lecture 1 29
Fast Fourier Transform
08/29/2019 CS294-73 - Lecture 1 30
Fast Fourier Transform
N/2
If N = 2M , we can apply this to F (E(x)) , F N/2 (O(x)) :
The number of flops to compute these
smaller Fourier transforms is
is also 2 x 2 x (N/2) = 2 N, given that you
have the N/4 transforms. Can continue
this process until computing 2M-1 sets
of F 2 , each of which costs O(1) flops.
So the total number of flops is O(M N) =
O(N log N). The algorithm is recursive,
and the data access pattern is
complicated.
08/29/2019 CS294-73 - Lecture 1 31
Particle Methods
Collection of particles, either representing physical particles, or a
discretization of a continuous field.
{xk , v k , wk }N
k=1
dxk
= vk
dt
dv k
= F (xk )
Xdt
F (x) = wk0 (r )(x xk 0 )
k0
To evaluate the force for a single particle requires N evaluations
of r , leading to an O(N2) cost per time step.
08/29/2019 CS294-73 - Lecture 1 32
Particle Methods
To reduce the cost, need to localize the force calculation. For typical
force laws arising in classical physics, there are two cases.
• Short-range forces (e.g. Lennard-Jones potential).
C1 C2
(x) =
|x|6 |x|12
⇥ (x) 0 if |x| >
The forces fall off sufficiently rapidly that the approximation
introduces acceptably small errors for practical values of the cutoff
distance.
08/29/2019 CS294-73 - Lecture 1 33
Particle Methods
• Coulomb / Newtonian potentials
1
(x) = in 3D
|x|
=log(|x|) in 2D
cannot be localized by cutoffs without
an unacceptable loss of accuracy.
However, the far field of a given
particle, while not small, is smooth,
with rapidly decaying derivatives. Can
take advantage of that in various
ways. In both cases, it is necessary to
sort the particles in space, and
organize the calculation around which
particles are nearby / far away.
08/29/2019 CS294-73 - Lecture 1 34
Options: “Buy or Build?”
• “Buy”: use software developed and maintained by someone else.
• “Build”: write your own.
• Some problems are sufficiently well-characterized that there are
bulletproof software packages freely available: LAPACK (dense
linear algebra), FFTW. You still need to understand their properties,
how to integrate it into your application.
• “Build” – but what do you use as a starting point ?
- Programming everything from the ground up.
- Use a framework that has some of the foundational components built
and optimized.
• Unlike LAPACK and FFTW, frameworks typically are not “black
boxes” – you will need to interact more deeply with them.
08/29/2019 CS294-73 - Lecture 1 35
Tradeoffs
• Models – How faithfully does the model reproduce reality, versus
the cost of computing with that model ? Well-posedness, especially
stability to small perturbations in inputs (because numerical
approximations generate them).
• Discretizations – replacing continuous variables by a finite number
of discrete variables. Numerical stability – the discrete system must
be resilient to arbitrary small perturbations to the inputs. Robustness
to off-design use.
• Software – correctness, performance. How difficult is this to
implement / modify, especially for high performance ? Correctness /
performance debugging.
• Data – inputs, outputs. How much data does this generate ? If it is
large, how do you look at it ?
The art of designing simulation software is navigating the tradeoffs
among these considerations to get the best scientific throughput.
08/29/2019 CS294-73 - Lecture 1 36
Roofline Model
• An example of a cartoon for performance.
Empirical Roofline Graph (Results.cori1.nersc.gov.05/Run.002)
1000 844.5 GFLOPs/sec (Maximum)
s
B/
G
6
2.
69
s
B/
-4
G
L1
8
9.
s
GFLOPs / sec
B/
46
G
-1
.1
L2
88
100
-9
s
B/
L3
G
.8
07
-1
AM
R
D
10
0.01 0.1 1 10 100
FLOPs / Byte
1
L( )i,j = ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h2
6 floating point operations (FLOPS), 16 Bytes data read / written.
08/29/2019 CS294-73 - Lecture 1
Roofline Model
• An example of a cartoon for performance.
Single Socket Roofline for NERSC’s Cori (Haswell partition Cray XC40)
1000 GFLOP/s Spec
Multiply/Add
Add
ec
L1 Sp
L1
L2
100 L3
e c
Sp
GFLOP/s
AM
R
D
AM
R
D
10
GMG/Cheby (fused)
GMG/Cheby
AMG/SpMV
FFT (2M)
FFT (1K)
1 DGEMM
0.01 0.1 1 10 100
FLOP/Byte
08/29/2019 CS294-73 - Lecture 1
What will you learn from taking this course ?
The skills and tools to allow you to understand (and perform) good
software design for scientific computing.
• Programming: expressiveness, performance, scalability to large
software systems (otherwise, you could do just fine in matlab).
• Data structures and algorithms as they arise in scientific
applications.
• Tools for organizing a large software development effort (build tools,
source code control).
• Debugging and data analysis tools.
08/29/2019 CS294-73 - Lecture 1 39