Open Analytics Environment

Towards an Open Analytics Environment Ian Foster Computation Institute Argonne National Lab & University of Chicago

The Computation Institute A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods. Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three). www.ci.uchicago.edu Faculty, fellows, staff, students, computers, projects.

The Good Old Days: Astronomy ~1600 30 years ? years 10 years 6 years 2 years

Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year

Biomedical Research ~2000 ... atcgaattccaggcgtcacattctcaattcca... MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Polymorphism and Variants genetic variants individual patients epidemiology Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. >10 6 ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based From John Wooley >10 6 >10 9 >10 6 >10 5 >10 9 DNA sequences alignments Proteins sequence 2º structure 3º structure

Growth of Sequences and Annotations since 1982 Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch , August 2006.

The Analyst in Denial “ I just need a bigger disk (and workstation)”

An Open Analytics Environment Data in “ No limits” Storage Computing Format Program Allowing for Versioning Provenance Collaboration Annotation Results out Programs & rules in

o·pen [oh-puhn] adjective having the interior immediately accessible relatively free of obstructions to sight, movement, or internal arrangement generous, liberal, or bounteous in operation; live readily admitting new members not constipated

What Goes In (2) Rules Workflows Dryad MapReduce Parallel programs SQL BPEL Swift SCFL R MatLab Octave

How it Cooks Virtualization Run any program, store any data Indexing Automated maintenance Provisioning Policy-driven allocation of resources to competing demands

What Comes Out Data Data Virtual Data Schema

Analysis as (Collaborative) Process Transform Annotate Search Add to Tag Visualize Discover Extend Group Share

Centralized or Distributed? Both

Towards an Open Analysis Environment: (1) Applications Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

Towards an Open Analysis Environment: (2) Hardware SiCortex 6K cores, 6 Top/s IBM BG/P 160K cores, 500 Top/s PADS 10-40 Gbit/s

PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup

Towards an Open Analysis Environment : (3) Methods HPC systems software (MPICH, PVFS, etc.) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management Distributed data management (GridFTP, etc.)

Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources

XDTM: XML Data Typing & Mapping ./group23 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC ./group23/AA : drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa . /group23/AA/04nov06aa : drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL . /group23/AA/04nov06aa/ANATOMY : -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img . /group23/AA/04nov06aa/FUNCTIONAL : -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img Logical Physical

fMRI Type Definitions type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; }

High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao

SwiftScript for fMRI Data Analysis (Run snr) functional ( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun ( r , "y" ); Run roRun = reorientRun ( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select ( roRun, 0.1 ); AirVector rndAirVec = align_linearRun ( rndr, std, 12, 1000, 1000, "81 3 3" ); Run reslicedRndr = resliceRun ( rndr, rndAirVec, "o", "k" ); Volume meanRand = softmean ( reslicedRndr, "y", "null" ); Air mnQAAir = alignlinear ( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); Warp boldNormWarp = combinewarp ( shrink, a.aWarp, mnQAAir ); … } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv , i in ir.v { or.v[i] = reorient( iv , direction); } }

Multi-level Scheduling SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler Specification Execution Virtual Node(s)‏ Worker Nodes Provenance data Provenance data Provenance collector launcher launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime)‏ Swift runtime callouts C C C C Status reporting Provisioning Falkon Resource Provisioner Amazon EC2

DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years (does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang

LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO Cardiff AEI/Golm

Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO

SIDGrid: B. Bertenthal et al., U.Chicago, IU, UIC

Social Informatics Data Grid (SIDgrid) TeraGrid PADS … SIDgrid Collaborative, multi-modal analysis of cognitive science data Diverse experimental data & metadata Browse data Search Content preview Transcode Download Analyze

A C ommunity I ntegrated M odel for E conomic a nd R esource T rajectories for H umankind ( CIM-EARTH ) Dynamics, foresight, uncertainty, resolution, … Agriculture, transport, taxation, … Data (global, local, …) (Super) computers CIM-EARTH Framework Community process Open code, data

Alleviating Poverty in Thailand: Modeling Entrepreneurship Consider only wealth, access to capital Consider also distance to 6 major cities Rob Townsend, Victor Zhorin, et al. Match High Low

GeneWays Online Journals Pathways GeneWays Andrey Rzhetsky et al. Screening 250,000 journal articles 2.5M reasoning chains 4M statements

Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Predictive Disease Susceptibility Physiology Metabolism Endocrine Proteome Immune Transcriptome Biomarker Signatures Morphometrics Pharmacokinetics Ethnicity Environment Age Gender Source: Terry Magnuson

James Evans, U.Chicago Arabidopsis articles

Open Analytics Environment

Recommended

More Related Content

What's hot (18)

Viewers also liked (17)

Similar to Open Analytics Environment (20)

More from Ian Foster (20)

Recently uploaded (20)

Open Analytics Environment