Introduction to Data and Memory Intensive Computing
Gordon Summer Institute & Cyberinfrastructure Summer Institute for Geoscientists 8/8/2011
Robert Sinkovits
Gordon Applications Lead
San Diego Supercomputer Center
Overview
Data Intensive Computing
Memory Intensive Computing
Blurring boundary between data/memory
How Gordon can help
Flash memory
Parallel le systems
vSMP
Allocations and operations
SDSC expertise, XSEDE AUSS
Early success stories
Data intensive computing
Data intensive problems can be characterized by the sizes of the input, output, and intermediate data sets
Can also be classied according to patterns of data access (e.g. random vs. sequential, small vs. large reads/writes)
Performance can be improved through changes to hardware, systems software, le systems, or user application
Data mining and certain types of visualization applications often require processing large amounts of raw data, but can end up producing fairly small amounts of output. In some cases, the result can be single number
See Leetaru presentation on Thursday morning
Human Genomics
Particle Physics
Large Hadron Collider
(7000PB)
1GB / person 200PB+ captured 200% CAGR
(15PB)
Annual Email Trafc, no spam
http://www.int ttp://www.inte World Wide Web
tp://www.intel p://www.intel. (~1PB)
://www.intel.c //www.intel.co
Estimated On-line RAM in Google
wiki wiki Wikipedia
iki (10GB)
w wiki ki wiki wi 100% CAGR
i wiki wik
Personal Digital Photos
Internet Archive
(300PB+)
200 of London s Trafc Cams
(1PB+)
2004 Walmart Transaction DB
(8PB)
Typical Oil Company
(1000PB+)
100% CAGR
Merck Bio Research DB
(8TB/day)
UPMC Hospitals Imaging Data
(500TB)
(350TB+)
(1.5TB/qtr)
One Day of Instant Messaging in 2002
MIT Babytalk Terashake Earthquake Speech Experiment
Model of LA Basin
(500TB/yr)
(1.4PB)
(1PB)
(750GB)
Phillip Gibbons, Intel Research Pittsburgh, 2008
Simulations involving integration of ODEs (e.g. molecular dynamics) or PDEs (e.g. CFD, structural mechanics, weather and climate modeling) may involve modest amounts of input data, but end up generating large amounts of output 4D data sets proportional to problem size x number of time steps
Many problems in domains such as graph algorithms, de novo sequence assembly, and quantum chemistry require intermediate les that are disproportionately large relative to the size of the input/output les
See Pearce presentation on Tuesday morning
Pfeiffer presentation on Wed morning
Generic compute node / terminology
Core
Processor
(Socket)
Node
(Board)
Peak = nodes !
processors cores flops ! ! clock speed ! node processor cycle
Memory intensive computing
No uniform denition for memory intensive computing, but here we use it to refer to problems that require more shared memory than is available on standard compute hardware
machine
Kraken
Ranger
Lonestar4
Gordon (1/1/12)
Athena
Trestles
Steele
Condor Pool
Lincoln
Blacklight
peak (TF)
nodes
1174
579
302
> 200
166
100
66
60
48
37
9408
3936
1888
1024
4152
324
893
1750
192
256
mem (TB)
147
123
44
64
16
20
28
27
3
32
mem/node (GB)
16
32
24
64 64
(512, 1024, 2048, )
4
64
16-32
0.5-32
16
128 128
(16384)
Most HPC systems are designed for distributed memory applications. Data structures decomposed into chunks that are assigned to distinct compute nodes, each with their own local memory
Why not just use a distributed memory model?
Data structures
Many important/interesting problems do not have data structures that map well to distributed memory graphs, trees, unstructured grids
Programmer effort
In some cases, the burden to develop distributed memory application (e.g. using MPI) is too great to justify the effort
Efciency of implementation
Sometimes the communications overhead for a distributed memory implementation is too high and results in poor performance
OpenMP
Thread based parallelism that employs a fork-join model. Straightforward to use and requires minimal code modication. Addition of pragmas or directives that are ignored unless compiler ags are set. Allows for incremental parallelization of code; ideal for loop level parallelism
#pragma omp parallel for \
reduction(+: sum) \ schedule(static, 10)
for (i=0; i<n; i++) {
a[i] = b[i] + c[i]
sum += a[i]
}
!$OMP parallel do
!$OMP& reduction(+: sum)
!$OMP& schedule(static, 10)
do i=1,n
a(i) = b(i) + c(i)
sum = sum + a(i)
enddo
In my opinion, this is THE best place to learn more about OpenMP
https://computing.llnl.gov/tutorials/openMP/
Memory intensive problem conformational sampling
Generation of molecular conformations from low order probability
distribution functions (PDFs) makes it possible to calculate
thermodynamic quantities that are not accessible from MD
Somani, Killian, and Gilson, J Chem Phys 130 (2009) Somani and Gilson, J Chem Phys 134 (2011)
Conformational sampling data structures and memory access
Data structures for N deg. of freedom and M bins
Singlet PDFs:
N blocks of size M
Doublet PDFs:
N(N-1)/2 blocks of size M2
Triplet PDFs:
N(N-1)(N-2)/6 blocks of size M3
Typically N ~50-200, M=30
Sample rows (pencils) from 2D (3D) arrays, with different access pattern for each conformation. Convenient to have entire problem in single shared memory. For N=200, M=30, required memory ~ 130 GB
Doublet sampling with N=6, M=5
Conformation 1
Conformation 2
Conformation 3
Gilson, Somani, and Sinkovits Dash/Gordon collaboration
Memory intensive problem subset removal algorithm
Transient objects detected in night sky by Large Synoptic
Survey Telescope (LSST)
Detections grouped into tracks that may represent partial orbits of asteroids or other near earth object
As candidate tracks are constructed, nd that some tracks are wholly contained within others - subset removal algorithm used to detect and delete these tracks
In addition to avoiding duplications, reduces computational load in later steps of the pipeline
See Myers presentation on Wed morning
subset removal data structure
Detections are stored in redblack tree to minimize access time. Tracks associated with each detection are also stored as trees, resulting in a tree-oftrees data structure.
Subset removal algorithm is most efcient if entire data structure is stored in shared memory. For realistic problems, memory footprint ~ 100 GB
See Myers presentation on Wed morning
Blurring the boundary between data and memory intensive computing
/scratch does not necessarily have to be hard disk (hdd)
To improve performance, can write scratch les to ash drives - O(102) lower latency than hdd
To do even better, can write les to DRAM O(103-105) lower latency than hdd
See Strande presentation on Mon afternoon
and Tatineni presentations on Tuesday morning
How Gordon can help you solve data/memory intensive problems
Flash memory
80 GB local to each compute node (1024 nodes)
4.8 TB served from each I/O node (64 nodes)
Parallel le systems
Lustre le system with 4 PB capacity
100 GB/s aggregate bandwidth into I/O nodes
vSMP
Aggregate memory - 512, 1024, 2048 GB
Allocations and operations
Dedicated long-term access to I/O nodes
Interactive queues for visualization
Software
Advanced User Support Services
See Strande presentation on Monday afternoon
Wilkins-Diehr presentation on Thursday morning
For data intensive applications, the main advantage of ash is the low latency
Performance of the memory subsystem has not kept up with gains in processor speed
As a result, latencies to access data from hard disk are O(10,000,000) cycles
Flash memory lls this gap and provides O(100) lower latency
As new non-volatile memories are developed, they can ll the role of ash
Using ash in memory hierarchy Parallel Streamline Visualization
Camp et al, accepted to IEEE Symp. on Large-Scale Data Analysis and
Visualization (LDAV 2011)
See Camp presentation on Tuesday morning
Introduction to vSMP
N x Servers
16 x 16 x 16 x 16 x OS
16 x OS
16 x OS
16 x OS
16 x OS
16 x OS
OS
OS
N OS x OS
1 VM
1 OS
Virtualization software for aggregating multiple off-the-shelf systems into a single virtual machine, providing improved usability and higher performance
PARTITIONING
Subsetofthephysicalresource
Virtual Machines
App
OS
App
OS
App
OS
AGGREGATION
Concatena1onofphysicalresources
Virtual Machine
App
OS
Hypervisor or VMM
Hypervisor or VMM
Hypervisor or VMM
Hypervisor or VMM
Hypervisor or VMM
See Paikowsky presentation on Wed afternoon
vSMP node congured from 16 compute nodes and one I/O node
vSMP node
To user, logically appears as a single, large SMP node
Overview of a vSMP node
Overview of a vSMP node
/proc/cpuinfo indicates 128 processors (16 nodes x 8 cores/node = 128)
Top shows 663 GB memory (16 nodes x 48 GB/node = 768 GB)
Difference due to vSMP overhead
Gordon Software
chemistry
adf
amber
gamess
gaussian
gromacs
lammps
namd
nwchem
distributed computing
globus
Hadoop
MapReduce
visualization
idl
NCL
paraview
tecplot
visit
VTK
genomics
abyss
blast
hmmer
soapdenovo
velvet
data mining
IntelligentMiner
RapidMiner
RATTLE
Weka
compilers/languages
gcc, intel, pgi
MATLAB, Octave, R
PGAS (UPC)
DB2, PostgreSQL
libraries
ATLAS
BLACS
fftw
HDF5
Hypre
SPRNG
superLU
* Partial list of software to be installed, open to user requests
From I/O bound to compute bound Breadth First Search
MR-BFS serial performance 134217726 nodes
3000
2500
I/O time
non-I/O time
2000
t (s)
1500
1000
500
0
SDDs
HDDs
Implementation of Breadth-rst search (BFS) graph algorithm developed by Munagala and Ranade
Benchmark problem: BFS on graph containing 134 million nodes
Use of ash drives reduced I/O time by factor of 6.5x. As expected, no measurable impact on non-I/O operations
Problem converted from I/O bound to compute bound
Using ash to improve serial performance LIDAR
4000
3500
3000
SSDs
HDDs
2500
2000
1500
1000
Remote sensing technology used to map geographic features with high resolution
Benchmark problem: Load 100 GB data into single table, then count rows. DB2 database instance
Flash drives 1.5x (load) to 2.4x (count) faster than hard disks
t (s)
500
0
100GB Load
100GB Load 100GB Count(*) 100GB Count(*) FastParse
Cold
Warm
Using ash to improve concurrency LIDAR
1200
1000
SSDs
HDDs
800
600
400
200
0
1 Concurrent
4 Concurrent
8 Concurrent
Remote sensing technology used to map geographic features with high resolution
Comparison of runtimes for concurrent LIDAR queries obtained with ash drives (SSD) and hard drives (HDD) using the Alaska Denali-Totschunda data collection.
Impact of SSDs was modest, but signicant when executing multiple simultaneous queries
t (s)
vSMP case study MOPS (subset removal)
16 8
MOPSsubsetremoval 79,684,646tracks
vSMP(3.5.175.22dyn) vSMP(3.5.175.17dyn) vSMP(3.5.175.17stat) PDAF
Rela%vespeed
Results at higher thread counts to be shown Wed.
0.5 1 2 4 8 16 32
cores
Total memory usage ~ 100 GB (3 boards)
See Myers presentation on Wed morning
Sets of detections collected using the Large Synoptic Survey Telescope are grouped into tracks representing potential asteroid orbits
Subset removal algorithm used to identify and eliminate those tracks that are wholly contained within other tracks
7.3x speedup on 8 cores
Better performance and scaling ( 8 threads) than physical large memory PDAF node
Summary
Gordon can be used to solve data and memory intensive problems that cannot be handled by even the largest HPC systems
Already having great success stories with Dash, but things will only get better with improved processors, interconnect, ash drive, and vSMP software
SDSC provides much more than cycles - our expertise can help you make the most of Gordon and enable transitions from desktop to supercomputing