0% found this document useful (0 votes)
186 views39 pages

CS 294-73 Software Engineering For Scientific Computing: Pcolella@berkeley - Edu Pcolella@lbl - Gov

This document provides an introduction and overview of the CS294-73 Software Engineering for Scientific Computing course. It outlines that the course grade will be based on homework assignments (60%) and a final group project (40%). It also describes the required software, expected programming skills, and introduces key concepts in scientific computing like models, discretization, and performance. The goal of the course is to teach skills for understanding and developing good software design for scientific applications.

Uploaded by

Edmund Zin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views39 pages

CS 294-73 Software Engineering For Scientific Computing: Pcolella@berkeley - Edu Pcolella@lbl - Gov

This document provides an introduction and overview of the CS294-73 Software Engineering for Scientific Computing course. It outlines that the course grade will be based on homework assignments (60%) and a final group project (40%). It also describes the required software, expected programming skills, and introduces key concepts in scientific computing like models, discretization, and performance. The goal of the course is to teach skills for understanding and developing good software design for scientific applications.

Uploaded by

Edmund Zin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CS 294-73 



Software Engineering for
Scientific Computing


[email protected]
[email protected]

Lecture 1: Introduction



Grading

•  5-6 homework assignments, adding up to 60% of the grade.


•  The final project is worth 40% of the grade.
-  Project will be a scientific program, preferably in an area related to
your research interests or thesis topic.
-  Novel architectures and technologies are not encouraged (they will
need to run on a standard Mac OS X or Linux workstation)
-  For the final project only, you will self-organize into teams to develop
your proposal. Undergraduates may need additional help developing a
project proposal.

08/29/2019 CS294-73 - Lecture 1 2


Hardware/Software Requirements

•  Laptop or desktop computer on which you have root permission


•  Mac OS X or Linux operating system
-  Cygwin or MinGW on Windows *might* work, but we have limited
experience there to help you.

•  Installed software (this is your IDE)


-  Gcc or clang
-  GNU Make
-  gdb or lldb
-  Ssh
-  VisIt
-  Doxygen
-  emacs
-  LaTex

08/29/2019 CS294-73 - Lecture 1 3


Homework and Project submission

•  Submission will be done via the class source code repository (git).

•  On midnight of the deadline date the homework submission


directory is made read-only.

•  We will be setting up times for you to get accounts.

08/29/2019 CS294-73 - Lecture 1 4


What we are not going to teach you in class

•  Navigating and using Unix


•  Unix commands you will want to know
-  ssh
-  scp
-  tar
-  gzip/gunzip
-  ls
-  mkdir
-  chmod
-  ln

•  Emphasis in class lectures will be explaining what is really going on,


not syntax issues. We will rely heavily on online reference material,
available at the class website.
•  Students with no prior experience with C/C++ are strongly urged to
take CS9F. 5
08/29/2019 CS294-73 - Lecture 1
What is Scientific Computing ?

We will be mainly interested in scientific computing as it arises in


simulation.
The scientific computing ecosystem:
•  A science or engineering problem that requires simulation.

•  Models – must be mathematically well posed.


•  Discretizations – replacing continuous variables by a finite number
of discrete variables.
•  Software – correctness, performance.

•  Data – inputs, outputs.


•  Hardware.

•  People.

08/29/2019 CS294-73 - Lecture 1 6


What will you learn from taking this course ?

The skills and tools to allow you to understand (and perform) good
software design for scientific computing.
•  Programming: expressiveness, performance, scalability to large
software systems (otherwise, you could do just fine in matlab).
•  Data structures and algorithms as they arise in scientific
applications.
•  Tools for organizing a large software development effort (build tools,
source code control).
•  Debugging and data analysis tools.

08/29/2019 CS294-73 - Lecture 1 7


Why C++ ?

(Compare to Matlab, Python, ...).


•  Strong typing + compilation. Catch large class of errors at compile
time, rather than at run time.
•  Strong scoping rules. Encapsulation, modularity.

•  Abstraction, orthogonalization. Use of libraries and layered


design.
C++, Java, some dialects of Fortran support these techniques to
various degrees well. The trick is doing so without sacrificing
performance. In this course, we will use C++.
-  Strongly typed language with a mature compiler technology.
-  Powerful abstraction mechanisms.

08/29/2019 CS294-73 - Lecture 1


Who should take this course ?

Students who don’t have the skills listed above, and expect to need them
soon.
•  Expect to take CS 267.
•  Building or adding to a large software system as part of your research.

•  Interested in scientific computing.


•  Interested in high-performance computing.

•  Prior to this semester, EECS graduate students were not permitted to take
this course.

08/29/2019 CS294-73 - Lecture 1


A Cartoon View of Hardware

What is a performance model ?


•  A “faithful cartoon” of how source code gets executed.
•  Languages / compilers / run-time systems that allow you to
implement based on that cartoon.
•  Tools to measure performance in terms of the cartoon, and close
the feedback loop.

08/29/2019 CS294-73 - Lecture 1


The Von Neumann Architecture / Model

Devices

CPU Memory

Instructions
registers
or data

•  Data and instructions are equivalent in terms of the memory.


•  Instructions are executed in a sequential order implied by the
source code.
•  Really easy cartoon to understand, program to.

•  The extent to which the cartoon is an illusion can have substantial


impact on the performance of your program.

08/29/2019 CS294-73 - Lecture 1 11


Memory Hierarchy

•  Take advantage of the principle of locality to:


-  Present as much memory as in the cheapest technology
-  Provide access at speed offered by the fastest technology

Processor

Core Core Core


Tertiary
core cache core cache core cache Secondary Storage
Main Storage (Tape/
Controller
Memory

Shared Cache Second Memory


(Disk/ Cloud
O(106) Level (DRAM/ FLASH/ Storage)
Cache FLASH/
core cache core cache core cache PCM)
(SRAM) PCM)
Core Core Core

Latency (ns): ~1 ~5-10 ~100 ~107 ~1010


Size (bytes): ~106 ~109 ~1012 ~1015

08/29/2019 CS294-73 - Lecture 1


The Principle of Locality
•  The Principle of Locality:
-  Program access a relatively small portion of the address space at any
instant of time.

•  Two Different Types of Locality:


-  Temporal Locality (Locality in Time): If an item is referenced, it will tend
to be referenced again soon (e.g., loops, reuse)
-  so, keep a copy of recently read memory in cache.
-  Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
-  Guess where the next memory reference is going to be based on
your access history.

•  Processors have relatively lots of bandwidth to memory, but also very


high latency. Cache is a way to hide latency.
-  Lots of pins, but talking over the pins is slow.
-  DRAM is (relatively) cheap and slow. Banking gives you more bandwidth

08/29/2019 CS294-73 - Lecture 1


Programs with locality cache well ...

Memory Address (one dot per Bad locality behavior

Temporal
access)

Locality

Spatial
Locality
Time
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
Journal 10(3): 168-192 (1971)
08/29/2019 CS294-73 - Lecture 1
Memory Hierarchy: Terminology

•  Hit: data appears in some block in the upper level (example: Block X)
-  Hit Rate: the fraction of memory access found in the upper level
-  Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss

•  Miss: data needs to be retrieve from a block in the lower level (Block
Y)
-  Miss Rate = 1 - (Hit Rate)
-  Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor

•  Hit Time << Miss Penalty


Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y

08/29/2019 CS294-73 - Lecture 1


Consequences for programming

•  A common way to exploit spatial locality is to try to get stride-1


memory access
-  Cache fetches a cache line worth of memory on each cache miss
-  Cache line can be 32-512 bytes (or more)

•  Each cache miss causes an access to the next deeper memory


hierarchy
-  Processor usually will sit idle while this is happening
-  When that cache-line arrives some existing data in your cache will be
ejected (which can result in a subsequent memory access resulting in
another cache miss. When this event happens with high frequency it is
called cache thrashing).

•  Caches are designed to work best for programs where data access
has lots of simple locality.

08/29/2019 CS294-73 - Lecture 1 16


But processor architectures keep changing

•  SIMD (vector) instructions: a(i) = b(i) + c(i), i = 1, … , 4 is as fast as


a0 = b0 + c0)
•  Non-uniform memory access
•  Many processing elements with varying performance

I will have someone give a guest lecture on this during the


semester. Otherwise, not our problem (but it will be in CS 267).

08/29/2019 CS294-73 - Lecture 1


Take a peek at your own computer

•  Most UNIX machines


-  >cat /etc/proc

•  Mac
-  >sysctl -a hw

08/29/2019 CS294-73 - Lecture 1 18


Seven Motifs of Scientific Computing
Simulation in the physical sciences and engineering is done out using
various combinations of the following core algorithms.
•  Structured grids
•  Unstructured grids

•  Dense linear algebra


•  Sparse linear algebra

•  Fast Fourier transforms

•  Particles
•  Monte Carlo (We won’t be doing this one)

Each of these has its own distinctive combination of computation and


data access.
There is a corresponding list for data (with significant overlap).
08/29/2019 CS294-73 - Lecture 1 19
Seven Motifs of Scientific Computing
Dwarf&Algorithm&Classification&by&node&
hours
•  Blue Waters usage patterns, in terms of motifs.

I/O
10%
Structured(
FFT Grid
16% 26%
Unstructured(Grid
Dense( 1%
Monte(Carlo
4% Matrix
N:Body
Sparse( 13%
16%
Matrix
14%

Figure 2.3-1 Colella’s seven dwarf classification of recognized applications run on Blue Wate
(by total node hours) in the study period, assuming equal weighting if an application is using
more than one algorithm in Table 10.0-1 in Appendix IV.
20
2.4#Numerical#Libraries##
08/29/2019 CS294-73 - Lecture 1
A “Big-O, Little-o” Notation

f = ⇥(g) if f = O(g) , g = O(f )

08/29/2019 CS294-73 - Lecture 1 21


Structured Grids

Used to represent continuously varying


quantities in space in terms of values on a
regular (usually rectangular) lattice.
= (x) ! i ⇡ (ih)
: B ! R , B ⇢ ZD

If B is a rectangle, data is stored in a contiguous block of memory.


B = [1, . . . , Nx ] ⇥ [1, . . . , Ny ]
i,j = chunk(i + (j 1)Nx )
Typical operations are stencil operations, e.g. to compute finite
difference approximations to derivatives.
1
L( )i,j = 2 ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h
Small number of flops per memory access, mixture of unit stride
and non-unit stride.
08/29/2019 CS294-73 - Lecture 1 22
Structured Grids

In practice, things can get much more = (x) ! i ⇡ (ih)


complicated: For example, if B is a union of : B ! R , B ⇢ ZD
rectangles, represented as a list.

To apply stencil operations, need to get values from neighboring


rectangles.
1
L( )i,j = 2 ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h
Can also have a nested hierarchy of grids, which means that
missing values must be interpolated.
Algorithmic / software issues: sorting, caching addressing
information, minimizing costs of irregular computation.
08/29/2019 CS294-73 - Lecture 1 23
Unstructured Grids

•  Simplest case: triangular / tetrahedral


elements, used to fit complex geometries.
Grid is specified as a collection of nodes,
organized into triangles.

N = {xn : n = 1, . . . , Nnodes }
E = {(xen1 , . . . , xenD+1 ) : e = 1, . . . , Nelts }

•  Discrete values of the function to be


represented are defined on nodes of the
grid.
= (x) is approximated by : N ⇥ R , n (xn )

•  Other access patterns required to solve PDE problems, e.g. find all of
the nodes that are connect to a node by an element. Algorithmic issues:
sorting, graph traversal.

08/29/2019 CS294-73 - Lecture 1 24


Dense Linear Algebra

Want to solve system of equations


0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1
B a2,1 a2,2 a2,3 ··· a2,n C B x 2 C B b2 C
B CB C B C
B a3,1 a3,2 a3,3 ··· a3,n C B C B C
B C B x 3 C = B b3 C
B .. .. .. .. .. C B .. C B .. C
@ . . . . . A@ . A @ . A
an,1 an,2 an,3 ··· an,n xn bn

08/29/2019 CS294-73 - Lecture 1 25


Dense linear algebra

0 Gaussian elimination:
10 1 0 1 0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1 a1,1 a1,2 a1,3 ··· a1,n x1 b1
B a2,1 a2,2 a2,3 ··· a2,n C B C B C B 0 a2,n C B x2 C B b2 C
C B C B
B C B x 2 C B b2 C B a2,2 a2,3 ··· C
B a3,1 a3,2 a3,3 ··· a3,n C B C B C B 0 a3,n C B x 3 C B b3 C
B C B x 3 C = B b3 C B
B ..
a3,2 a3,3 ··· CB C = B C
C B .. C B .. C
B .. .. .. .. .. C B .. C B .. C @ .
.. .. ..
A@ . A @ . A
@ . . . . . A@ . A @ . A . . .
an,1 an,2 an,3 ··· an,n xn bn 0 an,2 an,3 ··· an,n xn bn
ak,1
ak,l := ak,l a1,l
a1,1
ak,1
bl := bl b1
a1,1
0 10 1 0 1
a1,1 a1,2 a1,3 ··· a1,n x1 b1 The pth row reduction costs 2 (n-p)2
B 0 a2,2 a2,3 ··· a2,n C B C B C
B C B x 2 C B b2 C + O(n) flops, so that the total cost
B 0 0 a3,3 ··· a3,n C B C B C
B C B x 3 C = B b3 C is nX1
B .. .. .. .. .. C B .. C B .. C
@ . . . . . A@ . A @ . A 2(n p)2 + O(n2 ) = O(n3 )
0 0 an,3 ··· an,n xn bn p=1
ak,2
ak,l := ak,l a2,l Good for performance: unit stride
a2,2
ak,2 access, and O(n) flops per word of
bl := bl b2
a2,2 data accessed. But, if you have to
write back to main memory...
08/29/2019 CS294-73 - Lecture 1 26
Sparse Linear Algebra

⎛1.5 0 0 0 0 0 0 0 ⎞
⎜ ⎟
⎜ 0 2.3 0 1.4 0 0 0 0 ⎟
⎜ 0 0 3.7 0 0 0 0 0 ⎟
⎜ ⎟
⎜ 0 − 1.6 0 2.3 9.9 0 0 0 ⎟
A=⎜ ⎟
⎜ 0 0 0 0 5. 8 0 0 0 ⎟
⎜ 0 0 0 0 0 7.4 0 0 ⎟
⎜ ⎟
⎜ 0 0 1.9 0 0 0 4.9 0 ⎟
⎜ 0 0 0 0 0 0 0 3.6 ⎟⎠

Want to store only non-zeros, so use compressed-sparse-row storage (CSR) format.


JA 1 2 4 3 2 4 5 5 6 3 7 8

StA 1.5 2.3 1.4 3.7 –1.6 2.3 9.9 5.8 7.4 1.9 4.9 3.6

IA 1 2 4 5 8 9 10 12 13

08/29/2019 CS294-73 - Lecture 1 27


Sparse Linear Algebra

•  Matrix multiplication: indirect addressing.


Not a good fit for cache hierarchies.
IAk+1 1
X
(Ax)k = (StA)j xJAj , k = 1, . . . , 8
j=IAk

•  Gaussian elimination: fills in any columm


below a nonzero entry all the way to the
diagonal. Can attempt to minimize this by
reordering the variables.

•  Iterative methods for sparse matrices are based on applying the matrix to
the vector repeatedly. This avoids memory blowup from Gaussian
elimination, but need to have a good approximate inverse to work well.

08/29/2019 CS294-73 - Lecture 1 28


Fast Fourier Transform (Cooley and Tukey, 1965)

We also have
FkP (x) = Fk+P
P
(x)
So the number of flops to compute F N (x) is 2 N, given that you have

F N/2 (E(x)) , F N/2 (O(x))

08/29/2019 CS294-73 - Lecture 1 29


Fast Fourier Transform

08/29/2019 CS294-73 - Lecture 1 30


Fast Fourier Transform

N/2
If N = 2M , we can apply this to F (E(x)) , F N/2 (O(x)) :

The number of flops to compute these


smaller Fourier transforms is
is also 2 x 2 x (N/2) = 2 N, given that you
have the N/4 transforms. Can continue
this process until computing 2M-1 sets
of F 2 , each of which costs O(1) flops.
So the total number of flops is O(M N) =
O(N log N). The algorithm is recursive,
and the data access pattern is
complicated.

08/29/2019 CS294-73 - Lecture 1 31


Particle Methods

Collection of particles, either representing physical particles, or a


discretization of a continuous field.
{xk , v k , wk }N
k=1
dxk
= vk
dt
dv k
= F (xk )
Xdt
F (x) = wk0 (r )(x xk 0 )
k0

To evaluate the force for a single particle requires N evaluations


of r , leading to an O(N2) cost per time step.

08/29/2019 CS294-73 - Lecture 1 32


Particle Methods

To reduce the cost, need to localize the force calculation. For typical
force laws arising in classical physics, there are two cases.
•  Short-range forces (e.g. Lennard-Jones potential).
C1 C2
(x) =
|x|6 |x|12
⇥ (x) 0 if |x| >
The forces fall off sufficiently rapidly that the approximation
introduces acceptably small errors for practical values of the cutoff
distance.

08/29/2019 CS294-73 - Lecture 1 33


Particle Methods

•  Coulomb / Newtonian potentials


1
(x) = in 3D
|x|
=log(|x|) in 2D
cannot be localized by cutoffs without
an unacceptable loss of accuracy.
However, the far field of a given
particle, while not small, is smooth,
with rapidly decaying derivatives. Can
take advantage of that in various
ways. In both cases, it is necessary to
sort the particles in space, and
organize the calculation around which
particles are nearby / far away.

08/29/2019 CS294-73 - Lecture 1 34


Options: “Buy or Build?”

•  “Buy”: use software developed and maintained by someone else.


•  “Build”: write your own.
•  Some problems are sufficiently well-characterized that there are
bulletproof software packages freely available: LAPACK (dense
linear algebra), FFTW. You still need to understand their properties,
how to integrate it into your application.
•  “Build” – but what do you use as a starting point ?
-  Programming everything from the ground up.
-  Use a framework that has some of the foundational components built
and optimized.

•  Unlike LAPACK and FFTW, frameworks typically are not “black


boxes” – you will need to interact more deeply with them.

08/29/2019 CS294-73 - Lecture 1 35


Tradeoffs

•  Models – How faithfully does the model reproduce reality, versus


the cost of computing with that model ? Well-posedness, especially
stability to small perturbations in inputs (because numerical
approximations generate them).
•  Discretizations – replacing continuous variables by a finite number
of discrete variables. Numerical stability – the discrete system must
be resilient to arbitrary small perturbations to the inputs. Robustness
to off-design use.
•  Software – correctness, performance. How difficult is this to
implement / modify, especially for high performance ? Correctness /
performance debugging.
•  Data – inputs, outputs. How much data does this generate ? If it is
large, how do you look at it ?
The art of designing simulation software is navigating the tradeoffs
among these considerations to get the best scientific throughput.
08/29/2019 CS294-73 - Lecture 1 36
Roofline Model

•  An example of a cartoon for performance.


Empirical Roofline Graph (Results.cori1.nersc.gov.05/Run.002)
1000 844.5 GFLOPs/sec (Maximum)

s
B/
G
6
2.
69

s
B/
-4

G
L1

8
9.

s
GFLOPs / sec

B/
46

G
-1

.1
L2

88
100
-9

s
B/
L3

G
.8
07
-1
AM
R
D

10
0.01 0.1 1 10 100
FLOPs / Byte

1
L( )i,j = ( i,j+1 + i,j 1 + i+1,j + i 1,j 4 i,j )
h2
6 floating point operations (FLOPS), 16 Bytes data read / written.

08/29/2019 CS294-73 - Lecture 1


Roofline Model

•  An example of a cartoon for performance.


Single Socket Roofline for NERSC’s Cori (Haswell partition Cray XC40)

1000 GFLOP/s Spec


Multiply/Add

Add
ec
L1 Sp
L1

L2
100 L3

e c
Sp
GFLOP/s

AM
R
D

AM
R
D

10
GMG/Cheby (fused)
GMG/Cheby
AMG/SpMV

FFT (2M)
FFT (1K)

1 DGEMM
0.01 0.1 1 10 100
FLOP/Byte

08/29/2019 CS294-73 - Lecture 1


What will you learn from taking this course ?

The skills and tools to allow you to understand (and perform) good
software design for scientific computing.
•  Programming: expressiveness, performance, scalability to large
software systems (otherwise, you could do just fine in matlab).
•  Data structures and algorithms as they arise in scientific
applications.
•  Tools for organizing a large software development effort (build tools,
source code control).
•  Debugging and data analysis tools.

08/29/2019 CS294-73 - Lecture 1 39

You might also like