SlideShare a Scribd company logo
using R and High
Performance
Computers
an overview by Dave Hiltbrand
talking points
● why HPC?
● R environment tips
● staging R scripts for HPC
● purrr::map functions
what to do if the computation is too big
for your desktop/laptop ?
• a common user question:
– i have an existing R pipeline for my research work. but the data is
growing too big. now my R program runs for days (weeks) to finish or
simply runs out of memory.
• 3 Strategies
– move to bigger hardware
– advanced libraries/C++
– implement code using parallel packages
trends in HPC
➔ processors not getting faster
➔ increase performance => cram
more cores on each chip
➔ requires reducing clock speed
(power + heat)
➔ single-threaded applications
will run SLOWER on these new
resources, must start thinking in
parallel
https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
strategy 1: powerful hardware
Stampede2 - HPC
● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes
● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes
Maverick - Vis
● vis queue: 20 cores/ 256 GB mem/ 132 nodes
○ RStudio/ Jupyter Notebooks
● gpu queue: 132 NVIDIA Telsa K40 GPUs
Wrangler - Data
● Hadoop/Spark
● reservations last up to a month
allocations
open to national researcher community
do you work in industry?
XSEDE
● national organization providing computation
resources to ~ 90% of cycles on Stampede2
tip
if you need more power
all you have to do is ask
https://portal.xsede.org
/allocations/resource-
info
HPCs are:
➔ typically run with linux
➔ more command line
driven
➔ daunting to Windows
only users
➔ RStudio helps the
transition
login nodes
➔ always log into the login nodes
➔ shared nodes with limited
resources
➔ ok to edit, compile, move files
➔ for R, ok to install packages
from login nodes
➔ !!! don’t run R Scripts!!!
compute nodes
➔ dedicated nodes for each job
➔ only accessible via a job
scheduler
➔ once you have a job running on
a node you can ssh into the
node
access
R command line
● useful to install packages on login nodes
● using interactive development jobs you can request compute resources
to login straight to a compute node and use R via command line
RStudio
● availability depends on the structure of the HPC cluster
● at TACC the window to use RStudio is only 4 hours through the visual
portal
batch Jobs
● best method to use R on HPCs
● relies on a job scheduler to fill your request
● can run multiple R scripts on multiple compute nodes
sample batch script
#!/bin/bash
#----------------------------------------------------
#
#----------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.o%j # Name of stdout output file
#SBATCH -e myjob.e%j # Name of stderr error file
#SBATCH -p skx-normal # Queue (partition) name
#SBATCH -N 1 # Total # of nodes (must be 1 for
serial)
#SBATCH -n 1 # Total # of mpi tasks (should be
1 for serial)
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH --mail-user=myname@myschool.edu
#SBATCH --mail-type=all # Send email at begin and end of
job
#SBATCH -A myproject # Allocation name (req'd if you
have more than 1)
# Other commands must follow all #SBATCH directives...
module list
pwd
date
# Launch serial code...
Rscript ./my_analysis.R > output.Rout >> error.Rerr
# ---------------------------------------------------
.libPaths and Rprofile()
using your Rprofile.site or .Rprofile files along with
the .libPaths() command will allow you to install
packages in your user folder and have them load up
when you start R on the HPC.
in R, a library is the location on disk where you install your packages. R
creates a different library for each dot-version of R itself
when R starts, it performs a series of steps to initialize the session. you can
modify the startup sequence by changing the contents in a number of
locations.
the following sequence is somewhat simplified:
● first, R reads the file Rprofile.site in the R_Home/etc folder,
where R_HOME is the location where you installed R.
○ for example, this file could live at C:RR-
3.2.2etcRprofile.site.
○ making changes to this file affects all R sessions
that use this version of R.
○ this might be a good place to define your preferred
CRAN mirror, for example.
● next, R reads the file ~/.Rprofile in the user's home folder.
● lastly, R reads the file .Rprofile in the project folder
tip
i like to make a .Rprofile
for each GitHub project
repo which loads my
most commonly used
libraries by default.
going parallel
often you need to convert your code
into parallel form to get the most out
of HPC. the foreach and doMC
packages will let you convert loops
from sequential operation to parallel.
you can even use multiple nodes if you
have a really complex data set with the
snow package.
require( foreach )
require( doMC )
result <- foreach( i = 1:10, .combine = c) %dopar% {
myProc()
}
require( foreach )
require( doSNOW )
#Get backend hostnames
hostnames <- scan( "nodelist.txt", what="", sep="n" )
#Set reps to match core count
num.cores <- 4
hostnames <- rep( hostnames, each = num.cores )
cluster <- makeSOCKcluster( hostnames )
registerDoSNOW( cluster )
result <- foreach( i = 1:10, .combine=c ) %dopar% {
myProc()
}
stopCluster( cluster )
profiling
➔ simple procedure checks with
tictoc package
➔ use more advanced packages
like microbenchmark for
multiple procedures
➔ For an easy to read graphic
output use the profvis package
to create flamegraphs
checkpointing
➔ when writing your script think of
procedure runtime
➔ you can save objects in your
workflow as a checkpoint
◆ library(readr)
◆ write_rds(obj, “obj.rds”)
➔ if you want to run post hoc
analysis it makes it easier to
have all the parts
always start small
i’m quick i’m slow
build a toy dataset
find your typo’s
easier to rerun
run the real data
request the right
resources
once you run a small
dataset you can benchmark
resources needed
if you don’t already you need to Git
Git is a command-line tool,
but the center around
which all things involving
Git revolve is the hub—
GitHub.com—where
developers store their
projects and network with
like minded people.
use RStudio and all the
advanced IDE tools on
your local machine then
push and pull to GitHub to
run your job. RStudio
features built-in vcs
track changes in your
analysis, git lets you go
back in time to a previous
version of your file
Purrr Package
Map functions apply a function iteratively to each
element of a list or vector
the purrr map functions are an optional replacement to the
lapply functions. they are not technically faster ( although
the speed comparison is in nanoseconds ).
the main advantage is to use uniform syntax with other
tidyverse applications such as dplyr, tidyr, readr, and stringr
as well as the helper functions.
map( .x, .f, … )
map( vector_or_list_input, , function_to_apply,
optional_other_stuff )
modify( .x, .f, …)
ex. modify_at( my.data, 1:5, as.numeric)
https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
map in parallel
another key advantage from purrr is use of lambda
functions which has been crucial for analysis involving
multiple columns of a data frame. using the same
basic syntax we create an anonymous function which
maps over many lists simultaneously
my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4,
~ ( .x + .y ) / 2 ))
my.data %<>% mutate( var6 = pmap_dbl( list( .$var3,
.$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 ))
tip
using the grammar of
graphics, data, and lists
through tidyverse
packages can build a
strong workflow
closing
unburden your personal device
➔ learn basic linux cli
using batch job submissions gives you
the most flexibility
➔ profile/checkpoint/test
resources are not without limits
➔ share your code
don’t hold onto code until it’s perfect.
use GitHub and get feedback early and
often
$ questions -h
refs:
1. https://jennybc.github.io/purrr-tutorial/
2. https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes
3. https://learn.tacc.utexas.edu/mod/page/view.php?id=24
4. http://blog.revolutionanalytics.com/2015/11/r-projects.html
Ad

Recommended

Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
Kernel TLV
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
Performance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Debugging node in prod
Debugging node in prod
Yunong Xiao
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Introduction to SLURM
Introduction to SLURM
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
Linaro
 
Cascalog internal dsl_preso
Cascalog internal dsl_preso
Hadoop User Group
 
2017 meetup-apache-kafka-nov
2017 meetup-apache-kafka-nov
Florian Hussonnois
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
Valeriy Kravchuk
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
NETWAYS
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 
LXC on Ganeti
LXC on Ganeti
kawamuray
 
HCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality Checker
Linaro
 
Introduction of R on Hadoop
Introduction of R on Hadoop
Chung-Tsai Su
 
Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)
Huy Do
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 
Odoo command line interface
Odoo command line interface
Jalal Zahid
 
Process scheduling
Process scheduling
Hao-Ran Liu
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
Andrea Righi
 
Data Storage Formats in Hadoop
Data Storage Formats in Hadoop
Botond Balázs
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Python arch wiki
Python arch wiki
fikrul islamy
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Ml2
Ml2
poovarasu maniandan
 

More Related Content

What's hot (20)

Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
Linaro
 
Cascalog internal dsl_preso
Cascalog internal dsl_preso
Hadoop User Group
 
2017 meetup-apache-kafka-nov
2017 meetup-apache-kafka-nov
Florian Hussonnois
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
Valeriy Kravchuk
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
NETWAYS
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 
LXC on Ganeti
LXC on Ganeti
kawamuray
 
HCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality Checker
Linaro
 
Introduction of R on Hadoop
Introduction of R on Hadoop
Chung-Tsai Su
 
Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)
Huy Do
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 
Odoo command line interface
Odoo command line interface
Jalal Zahid
 
Process scheduling
Process scheduling
Hao-Ran Liu
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
Andrea Righi
 
Data Storage Formats in Hadoop
Data Storage Formats in Hadoop
Botond Balázs
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 
Python arch wiki
Python arch wiki
fikrul islamy
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
Linaro
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
Valeriy Kravchuk
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
NETWAYS
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 
LXC on Ganeti
LXC on Ganeti
kawamuray
 
HCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality Checker
Linaro
 
Introduction of R on Hadoop
Introduction of R on Hadoop
Chung-Tsai Su
 
Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)
Huy Do
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 
Odoo command line interface
Odoo command line interface
Jalal Zahid
 
Process scheduling
Process scheduling
Hao-Ran Liu
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
Andrea Righi
 
Data Storage Formats in Hadoop
Data Storage Formats in Hadoop
Botond Balázs
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
 

Similar to Using R on High Performance Computers (20)

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Ml2
Ml2
poovarasu maniandan
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in R
Lun-Hsien Chang
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
Reproducible Computational Research in R
Reproducible Computational Research in R
Samuel Bosch
 
Parallel Computing with R
Parallel Computing with R
Abhirup Mallik
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
Sascha Dittmann
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
R - the language
R - the language
Mike Martinez
 
Getting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
A Step Towards Reproducibility in R
A Step Towards Reproducibility in R
Revolution Analytics
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in R
Lun-Hsien Chang
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
Reproducible Computational Research in R
Reproducible Computational Research in R
Samuel Bosch
 
Parallel Computing with R
Parallel Computing with R
Abhirup Mallik
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
Sascha Dittmann
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Getting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
A Step Towards Reproducibility in R
A Step Towards Reproducibility in R
Revolution Analytics
 
Ad

Recently uploaded (20)

一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
25 items quiz for practical research 1 in grade 11
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
25 items quiz for practical research 1 in grade 11
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Ad

Using R on High Performance Computers

  • 1. using R and High Performance Computers an overview by Dave Hiltbrand
  • 2. talking points ● why HPC? ● R environment tips ● staging R scripts for HPC ● purrr::map functions
  • 3. what to do if the computation is too big for your desktop/laptop ? • a common user question: – i have an existing R pipeline for my research work. but the data is growing too big. now my R program runs for days (weeks) to finish or simply runs out of memory. • 3 Strategies – move to bigger hardware – advanced libraries/C++ – implement code using parallel packages
  • 4. trends in HPC ➔ processors not getting faster ➔ increase performance => cram more cores on each chip ➔ requires reducing clock speed (power + heat) ➔ single-threaded applications will run SLOWER on these new resources, must start thinking in parallel https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
  • 5. strategy 1: powerful hardware Stampede2 - HPC ● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes ● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes Maverick - Vis ● vis queue: 20 cores/ 256 GB mem/ 132 nodes ○ RStudio/ Jupyter Notebooks ● gpu queue: 132 NVIDIA Telsa K40 GPUs Wrangler - Data ● Hadoop/Spark ● reservations last up to a month
  • 6. allocations open to national researcher community do you work in industry? XSEDE ● national organization providing computation resources to ~ 90% of cycles on Stampede2 tip if you need more power all you have to do is ask https://portal.xsede.org /allocations/resource- info
  • 7. HPCs are: ➔ typically run with linux ➔ more command line driven ➔ daunting to Windows only users ➔ RStudio helps the transition
  • 8. login nodes ➔ always log into the login nodes ➔ shared nodes with limited resources ➔ ok to edit, compile, move files ➔ for R, ok to install packages from login nodes ➔ !!! don’t run R Scripts!!! compute nodes ➔ dedicated nodes for each job ➔ only accessible via a job scheduler ➔ once you have a job running on a node you can ssh into the node
  • 9. access R command line ● useful to install packages on login nodes ● using interactive development jobs you can request compute resources to login straight to a compute node and use R via command line RStudio ● availability depends on the structure of the HPC cluster ● at TACC the window to use RStudio is only 4 hours through the visual portal batch Jobs ● best method to use R on HPCs ● relies on a job scheduler to fill your request ● can run multiple R scripts on multiple compute nodes
  • 10. sample batch script #!/bin/bash #---------------------------------------------------- # #---------------------------------------------------- #SBATCH -J myjob # Job name #SBATCH -o myjob.o%j # Name of stdout output file #SBATCH -e myjob.e%j # Name of stderr error file #SBATCH -p skx-normal # Queue (partition) name #SBATCH -N 1 # Total # of nodes (must be 1 for serial) #SBATCH -n 1 # Total # of mpi tasks (should be 1 for serial) #SBATCH -t 01:30:00 # Run time (hh:mm:ss) #SBATCH [email protected] #SBATCH --mail-type=all # Send email at begin and end of job #SBATCH -A myproject # Allocation name (req'd if you have more than 1) # Other commands must follow all #SBATCH directives... module list pwd date # Launch serial code... Rscript ./my_analysis.R > output.Rout >> error.Rerr # ---------------------------------------------------
  • 11. .libPaths and Rprofile() using your Rprofile.site or .Rprofile files along with the .libPaths() command will allow you to install packages in your user folder and have them load up when you start R on the HPC. in R, a library is the location on disk where you install your packages. R creates a different library for each dot-version of R itself when R starts, it performs a series of steps to initialize the session. you can modify the startup sequence by changing the contents in a number of locations. the following sequence is somewhat simplified: ● first, R reads the file Rprofile.site in the R_Home/etc folder, where R_HOME is the location where you installed R. ○ for example, this file could live at C:RR- 3.2.2etcRprofile.site. ○ making changes to this file affects all R sessions that use this version of R. ○ this might be a good place to define your preferred CRAN mirror, for example. ● next, R reads the file ~/.Rprofile in the user's home folder. ● lastly, R reads the file .Rprofile in the project folder tip i like to make a .Rprofile for each GitHub project repo which loads my most commonly used libraries by default.
  • 12. going parallel often you need to convert your code into parallel form to get the most out of HPC. the foreach and doMC packages will let you convert loops from sequential operation to parallel. you can even use multiple nodes if you have a really complex data set with the snow package. require( foreach ) require( doMC ) result <- foreach( i = 1:10, .combine = c) %dopar% { myProc() } require( foreach ) require( doSNOW ) #Get backend hostnames hostnames <- scan( "nodelist.txt", what="", sep="n" ) #Set reps to match core count num.cores <- 4 hostnames <- rep( hostnames, each = num.cores ) cluster <- makeSOCKcluster( hostnames ) registerDoSNOW( cluster ) result <- foreach( i = 1:10, .combine=c ) %dopar% { myProc() } stopCluster( cluster )
  • 13. profiling ➔ simple procedure checks with tictoc package ➔ use more advanced packages like microbenchmark for multiple procedures ➔ For an easy to read graphic output use the profvis package to create flamegraphs checkpointing ➔ when writing your script think of procedure runtime ➔ you can save objects in your workflow as a checkpoint ◆ library(readr) ◆ write_rds(obj, “obj.rds”) ➔ if you want to run post hoc analysis it makes it easier to have all the parts
  • 14. always start small i’m quick i’m slow build a toy dataset find your typo’s easier to rerun run the real data request the right resources once you run a small dataset you can benchmark resources needed
  • 15. if you don’t already you need to Git Git is a command-line tool, but the center around which all things involving Git revolve is the hub— GitHub.com—where developers store their projects and network with like minded people. use RStudio and all the advanced IDE tools on your local machine then push and pull to GitHub to run your job. RStudio features built-in vcs track changes in your analysis, git lets you go back in time to a previous version of your file
  • 16. Purrr Package Map functions apply a function iteratively to each element of a list or vector
  • 17. the purrr map functions are an optional replacement to the lapply functions. they are not technically faster ( although the speed comparison is in nanoseconds ). the main advantage is to use uniform syntax with other tidyverse applications such as dplyr, tidyr, readr, and stringr as well as the helper functions. map( .x, .f, … ) map( vector_or_list_input, , function_to_apply, optional_other_stuff ) modify( .x, .f, …) ex. modify_at( my.data, 1:5, as.numeric) https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
  • 18. map in parallel another key advantage from purrr is use of lambda functions which has been crucial for analysis involving multiple columns of a data frame. using the same basic syntax we create an anonymous function which maps over many lists simultaneously my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4, ~ ( .x + .y ) / 2 )) my.data %<>% mutate( var6 = pmap_dbl( list( .$var3, .$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 )) tip using the grammar of graphics, data, and lists through tidyverse packages can build a strong workflow
  • 19. closing unburden your personal device ➔ learn basic linux cli using batch job submissions gives you the most flexibility ➔ profile/checkpoint/test resources are not without limits ➔ share your code don’t hold onto code until it’s perfect. use GitHub and get feedback early and often
  • 20. $ questions -h refs: 1. https://jennybc.github.io/purrr-tutorial/ 2. https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes 3. https://learn.tacc.utexas.edu/mod/page/view.php?id=24 4. http://blog.revolutionanalytics.com/2015/11/r-projects.html