0% found this document useful (0 votes)
97 views31 pages

Introduction To Data and Memory Intensive Computing

This document provides an overview of data and memory intensive computing and how the Gordon supercomputer can help with such problems. It discusses how data intensive problems involve large input/output datasets while memory intensive problems require more memory than a single node. It also outlines how flash memory, parallel file systems, virtual shared memory (vSMP), and software tools on Gordon can improve performance for these types of applications.

Uploaded by

Qamar Nangraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views31 pages

Introduction To Data and Memory Intensive Computing

This document provides an overview of data and memory intensive computing and how the Gordon supercomputer can help with such problems. It discusses how data intensive problems involve large input/output datasets while memory intensive problems require more memory than a single node. It also outlines how flash memory, parallel file systems, virtual shared memory (vSMP), and software tools on Gordon can improve performance for these types of applications.

Uploaded by

Qamar Nangraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to Data and Memory Intensive Computing

Gordon Summer Institute & Cyberinfrastructure Summer Institute for Geoscientists 8/8/2011 Robert Sinkovits Gordon Applications Lead San Diego Supercomputer Center

Overview
Data Intensive Computing Memory Intensive Computing Blurring boundary between data/memory How Gordon can help
Flash memory Parallel le systems vSMP Allocations and operations SDSC expertise, XSEDE AUSS

Early success stories

Data intensive computing



Data intensive problems can be characterized by the sizes of the input, output, and intermediate data sets Can also be classied according to patterns of data access (e.g. random vs. sequential, small vs. large reads/writes) Performance can be improved through changes to hardware, systems software, le systems, or user application

Data mining and certain types of visualization applications often require processing large amounts of raw data, but can end up producing fairly small amounts of output. In some cases, the result can be single number

See Leetaru presentation on Thursday morning

Human Genomics Particle Physics Large Hadron Collider

(7000PB)
1GB / person 200PB+ captured 200% CAGR

(15PB)
Annual Email Trafc, no spam

http://www.int ttp://www.inte World Wide Web tp://www.intel p://www.intel. (~1PB) ://www.intel.c //www.intel.co
Estimated On-line RAM in Google

wiki wiki Wikipedia iki (10GB) w wiki ki wiki wi 100% CAGR i wiki wik
Personal Digital  Photos

Internet Archive

(300PB+)
200 of London s Trafc Cams

(1PB+)
2004 Walmart Transaction DB

(8PB)
Typical Oil Company

(1000PB+)
100% CAGR
Merck Bio Research DB

(8TB/day)
UPMC Hospitals Imaging Data

(500TB)

(350TB+)

(1.5TB/qtr)
One Day of Instant Messaging in 2002

MIT Babytalk Terashake Earthquake Speech Experiment Model of LA Basin

(500TB/yr)

(1.4PB)

(1PB)

(750GB)

Phillip Gibbons, Intel Research Pittsburgh, 2008


Simulations involving integration of ODEs (e.g. molecular dynamics) or PDEs (e.g. CFD, structural mechanics, weather and climate modeling) may involve modest amounts of input data, but end up generating large amounts of output 4D data sets proportional to problem size x number of time steps

Many problems in domains such as graph algorithms, de novo sequence assembly, and quantum chemistry require intermediate les that are disproportionately large relative to the size of the input/output les

See Pearce presentation on Tuesday morning Pfeiffer presentation on Wed morning

Generic compute node / terminology


Core

Processor (Socket) Node (Board)

Peak = nodes !

processors cores flops ! ! clock speed ! node processor cycle

Memory intensive computing


No uniform denition for memory intensive computing, but here we use it to refer to problems that require more shared memory than is available on standard compute hardware
machine Kraken Ranger Lonestar4 Gordon (1/1/12) Athena Trestles Steele Condor Pool Lincoln Blacklight peak (TF) nodes 1174 579 302 > 200 166 100 66 60 48 37 9408 3936 1888 1024 4152 324 893 1750 192 256 mem (TB) 147 123 44 64 16 20 28 27 3 32 mem/node (GB) 16 32 24 64 64 (512, 1024, 2048, ) 4 64 16-32 0.5-32 16 128 128 (16384)

Most HPC systems are designed for distributed memory applications. Data structures decomposed into chunks that are assigned to distinct compute nodes, each with their own local memory

Why not just use a distributed memory model?


Data structures
Many important/interesting problems do not have data structures that map well to distributed memory graphs, trees, unstructured grids

Programmer effort
In some cases, the burden to develop distributed memory application (e.g. using MPI) is too great to justify the effort

Efciency of implementation
Sometimes the communications overhead for a distributed memory implementation is too high and results in poor performance

OpenMP
Thread based parallelism that employs a fork-join model. Straightforward to use and requires minimal code modication. Addition of pragmas or directives that are ignored unless compiler ags are set. Allows for incremental parallelization of code; ideal for loop level parallelism

#pragma omp parallel for \ reduction(+: sum) \ schedule(static, 10) for (i=0; i<n; i++) { a[i] = b[i] + c[i] sum += a[i] }

!$OMP parallel do !$OMP& reduction(+: sum) !$OMP& schedule(static, 10) do i=1,n a(i) = b(i) + c(i) sum = sum + a(i) enddo

In my opinion, this is THE best place to learn more about OpenMP https://computing.llnl.gov/tutorials/openMP/

Memory intensive problem conformational sampling Generation of molecular conformations from low order probability distribution functions (PDFs) makes it possible to calculate
thermodynamic quantities that are not accessible from MD

Somani, Killian, and Gilson, J Chem Phys 130 (2009) Somani and Gilson, J Chem Phys 134 (2011)

Conformational sampling data structures and memory access


Data structures for N deg. of freedom and M bins Singlet PDFs: N blocks of size M Doublet PDFs: N(N-1)/2 blocks of size M2 Triplet PDFs: N(N-1)(N-2)/6 blocks of size M3 Typically N ~50-200, M=30 Sample rows (pencils) from 2D (3D) arrays, with different access pattern for each conformation. Convenient to have entire problem in single shared memory. For N=200, M=30, required memory ~ 130 GB

Doublet sampling with N=6, M=5

Conformation 1

Conformation 2

Conformation 3

Gilson, Somani, and Sinkovits Dash/Gordon collaboration

Memory intensive problem subset removal algorithm Transient objects detected in night sky by Large Synoptic
Survey Telescope (LSST) Detections grouped into tracks that may represent partial orbits of asteroids or other near earth object As candidate tracks are constructed, nd that some tracks are wholly contained within others - subset removal algorithm used to detect and delete these tracks In addition to avoiding duplications, reduces computational load in later steps of the pipeline

See Myers presentation on Wed morning

subset removal data structure


Detections are stored in redblack tree to minimize access time. Tracks associated with each detection are also stored as trees, resulting in a tree-oftrees data structure. Subset removal algorithm is most efcient if entire data structure is stored in shared memory. For realistic problems, memory footprint ~ 100 GB

See Myers presentation on Wed morning

Blurring the boundary between data and memory intensive computing


/scratch does not necessarily have to be hard disk (hdd) To improve performance, can write scratch les to ash drives - O(102) lower latency than hdd To do even better, can write les to DRAM O(103-105) lower latency than hdd
See Strande presentation on Mon afternoon and Tatineni presentations on Tuesday morning

How Gordon can help you solve data/memory intensive problems


Flash memory 80 GB local to each compute node (1024 nodes) 4.8 TB served from each I/O node (64 nodes) Parallel le systems Lustre le system with 4 PB capacity 100 GB/s aggregate bandwidth into I/O nodes vSMP Aggregate memory - 512, 1024, 2048 GB Allocations and operations Dedicated long-term access to I/O nodes Interactive queues for visualization Software Advanced User Support Services
See Strande presentation on Monday afternoon Wilkins-Diehr presentation on Thursday morning

For data intensive applications, the main advantage of ash is the low latency

Performance of the memory subsystem has not kept up with gains in processor speed As a result, latencies to access data from hard disk are O(10,000,000) cycles Flash memory lls this gap and provides O(100) lower latency As new non-volatile memories are developed, they can ll the role of ash

Using ash in memory hierarchy Parallel Streamline Visualization

Camp et al, accepted to IEEE Symp. on Large-Scale Data Analysis and Visualization (LDAV 2011) See Camp presentation on Tuesday morning

Introduction to vSMP

N x Servers 16 x 16 x 16 x 16 x OS 16 x OS 16 x OS 16 x OS 16 x OS 16 x OS OS OS N OS x OS 1 VM

1 OS

Virtualization software for aggregating multiple off-the-shelf systems into a single virtual machine, providing improved usability and higher performance

PARTITIONING
Subsetofthephysicalresource
Virtual Machines App OS App OS App OS

AGGREGATION
Concatena1onofphysicalresources
Virtual Machine App OS

Hypervisor or VMM
Hypervisor or VMM Hypervisor or VMM Hypervisor or VMM Hypervisor or VMM

See Paikowsky presentation on Wed afternoon

vSMP node congured from 16 compute nodes and one I/O node
vSMP node

To user, logically appears as a single, large SMP node

Overview of a vSMP node

Overview of a vSMP node

/proc/cpuinfo indicates 128 processors (16 nodes x 8 cores/node = 128)

Top shows 663 GB memory (16 nodes x 48 GB/node = 768 GB) Difference due to vSMP overhead

Gordon Software
chemistry adf amber gamess gaussian gromacs lammps namd nwchem
distributed computing globus Hadoop MapReduce visualization idl NCL paraview tecplot visit VTK genomics abyss blast hmmer soapdenovo velvet data mining IntelligentMiner RapidMiner RATTLE Weka

compilers/languages gcc, intel, pgi MATLAB, Octave, R PGAS (UPC) DB2, PostgreSQL

libraries ATLAS BLACS fftw HDF5 Hypre SPRNG superLU

* Partial list of software to be installed, open to user requests

From I/O bound to compute bound Breadth First Search


MR-BFS serial performance 134217726 nodes
3000

2500

I/O time non-I/O time

2000 t (s)

1500

1000

500

0 SDDs HDDs

Implementation of Breadth-rst search (BFS) graph algorithm developed by Munagala and Ranade Benchmark problem: BFS on graph containing 134 million nodes Use of ash drives reduced I/O time by factor of 6.5x. As expected, no measurable impact on non-I/O operations Problem converted from I/O bound to compute bound

Using ash to improve serial performance LIDAR


4000

3500

3000

SSDs HDDs

2500

2000

1500

1000

Remote sensing technology used to map geographic features with high resolution Benchmark problem: Load 100 GB data into single table, then count rows. DB2 database instance Flash drives 1.5x (load) to 2.4x (count) faster than hard disks

t (s)

500

0 100GB Load 100GB Load 100GB Count(*) 100GB Count(*) FastParse Cold Warm

Using ash to improve concurrency LIDAR 1200


1000

SSDs HDDs

800

600

400

200

0 1 Concurrent 4 Concurrent 8 Concurrent

Remote sensing technology used to map geographic features with high resolution Comparison of runtimes for concurrent LIDAR queries obtained with ash drives (SSD) and hard drives (HDD) using the Alaska Denali-Totschunda data collection. Impact of SSDs was modest, but signicant when executing multiple simultaneous queries

t (s)

vSMP case study MOPS (subset removal)


16 8

MOPSsubsetremoval 79,684,646tracks
vSMP(3.5.175.22dyn) vSMP(3.5.175.17dyn) vSMP(3.5.175.17stat) PDAF

Rela%vespeed

Results at higher thread counts to be shown Wed.

0.5 1 2 4 8 16 32

cores

Total memory usage ~ 100 GB (3 boards) See Myers presentation on Wed morning

Sets of detections collected using the Large Synoptic Survey Telescope are grouped into tracks representing potential asteroid orbits Subset removal algorithm used to identify and eliminate those tracks that are wholly contained within other tracks 7.3x speedup on 8 cores Better performance and scaling ( 8 threads) than physical large memory PDAF node

Summary
Gordon can be used to solve data and memory intensive problems that cannot be handled by even the largest HPC systems Already having great success stories with Dash, but things will only get better with improved processors, interconnect, ash drive, and vSMP software SDSC provides much more than cycles - our expertise can help you make the most of Gordon and enable transitions from desktop to supercomputing

You might also like