SlideShare a Scribd company logo
2
Most read
3
Most read
13
Most read
High Performance Computing (HPC) in cloud
Overview
1. Introduction to HPC
2. Hadoop (HDFS, MapReduce)
3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift)
4. Case study
Why?
Large data files from sequencers.
Computational bottleneck.
Processing time.
Data persistence and reliability.
Data security.
Bottlenecks in Genome Analysis
How?
Introduction
“High Performance Computing (HPC) most generally refers to the practice of
aggregating computing power in a way that delivers much higher performance
than one could get out of a typical desktop computer or workstation in order to
solve large problems in science, engineering, or business. ”
Dedicated supercomputer.
Commodity HPC cluster.
Grid computing.
HPC in cloud.
Forms of HPC
What?
Hadoop
Open source Java based framework for reliable, scalable and distributed
computing.
Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS)
in 2003.
Key Components
Hadoop Distributed File System (HDFS)
MapReduce
Hadoop
HDFS (Hadoop Distributed File System)
Data management layer
Master-Slave architecture
Fault Tolerant
Key Components:
NameNode
SecondaryNamenode
Hadoop- Continued
MapReduce
Mappers and Reducers
Batch oriented
Key Components
JobTracker
TaskTracker
Hadoop- Architecture
AWS ToolKit - Amazon Elastic MapReduce (EMR)
Managed Hadoop framework.
Runs almost all popular distributed frameworks such as Apache Spark, HBase,
Presto, and Flink.
Elastic.
Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS).
Secure and reliable.
Full control and root access.
AWS ToolKit - Amazon EMR
aws emr create-cluster 
--name "demo" 
--release-label emr-4.5.0 
--instance-type m3.xlarge 
--instance-count 2 
--ec2-attributes KeyName=YOUR-AWS-SSH-KEY 
--use-default-roles 
--applications Name=Hive Name=Spark
aws emr create-cluster 
--name "Test cluster" 
--ami-version 2.4 
--applications Name=Hive Name=Pig 
--use-default-roles --ec2-attributes KeyName=myKey 
--instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge 
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge 
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,
Args=[-f,s3://mybucket/scripts/pigscript.pig,-p, 
INPUT=s3://mybucket/inputdata/,-p, 
OUTPUT=s3://mybucket/outputdata/, 
$INPUT=s3://mybucket/inputdata/, 
$OUTPUT=s3://mybucket/outputdata/]
AWS ToolKit - Amazon S3 (Simple Storage Service)
Virtually infinite storage.
Single object size up to 5TB.
Why use S3?
Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use.
Decouple storage and computation resources.
HDFS requirements and implements EMRFS.
AWS ToolKit - Amazon Redshift
Fast, simple petabyte-scale data warehouse.
Use SQL query to interact.
Massively parallel.
Relational.
Architecture - Leader Node and Compute node.
Fast - 4 GB/sec/node.
Case Study- Rail RNA
Cloud-enabled spliced aligner that analyzes many samples at once.
Architecture - Amazon S3, Amazon EMR.
~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps.
Input to result - 2 weeks.
Cost- ~$1.40/sample.
Paper- Splicing across SRA.
Thank you

More Related Content

PDF
Introducing Amazon EKS Anywhere On Apache CloudStack
ShapeBlue
 
PDF
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
Open Source Consulting
 
PPTX
WSL2 and Docker Desktop
Stefan Scherer
 
PDF
Kubernetes Basics
Eueung Mulyana
 
PPTX
Docker Kubernetes Istio
Araf Karsh Hamid
 
PDF
Jenkins
Roger Xia
 
PPTX
Virtualization Vs. Containers
actualtechmedia
 
PDF
HDFS Architecture
Jeff Hammerbacher
 
Introducing Amazon EKS Anywhere On Apache CloudStack
ShapeBlue
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
Open Source Consulting
 
WSL2 and Docker Desktop
Stefan Scherer
 
Kubernetes Basics
Eueung Mulyana
 
Docker Kubernetes Istio
Araf Karsh Hamid
 
Jenkins
Roger Xia
 
Virtualization Vs. Containers
actualtechmedia
 
HDFS Architecture
Jeff Hammerbacher
 

What's hot (20)

PDF
Kubernetes Networking - Sreenivas Makam - Google - CC18
CodeOps Technologies LLP
 
PDF
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
충섭 김
 
PPTX
Terraform modules restructured
Ami Mahloof
 
PDF
Packer by HashiCorp
Łukasz Cieśluk
 
PDF
CloudStack vs OpenStack
Victor Zhang
 
PPTX
Challenges in Cloud Computing – VM Migration
Sarmad Makhdoom
 
PDF
MinIO January 2020 Briefing
Jonathan Symonds
 
PPTX
Unix shell scripting basics
Manav Prasad
 
PPT
Unix And Shell Scripting
Jaibeer Malik
 
DOCX
Linux admin interview questions
Kavya Sri
 
PPTX
Learn nginx in 90mins
Larry Cai
 
PPT
Shell and its types in LINUX
SHUBHA CHATURVEDI
 
PPTX
Terraform
Phil Wilkins
 
PDF
Containers: The What, Why, and How
Sneha Inguva
 
PPTX
PowerShell-1
Saravanan G
 
PDF
Introduction to virtualization
Sasikumar Thirumoorthy
 
ODP
An Introduction To Jenkins
Knoldus Inc.
 
PDF
An Introduction to Azure IaaS
Applied Information Sciences
 
PPT
Virtualization in cloud computing ppt
Mehul Patel
 
PDF
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
YongSung Yoon
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
CodeOps Technologies LLP
 
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
충섭 김
 
Terraform modules restructured
Ami Mahloof
 
Packer by HashiCorp
Łukasz Cieśluk
 
CloudStack vs OpenStack
Victor Zhang
 
Challenges in Cloud Computing – VM Migration
Sarmad Makhdoom
 
MinIO January 2020 Briefing
Jonathan Symonds
 
Unix shell scripting basics
Manav Prasad
 
Unix And Shell Scripting
Jaibeer Malik
 
Linux admin interview questions
Kavya Sri
 
Learn nginx in 90mins
Larry Cai
 
Shell and its types in LINUX
SHUBHA CHATURVEDI
 
Terraform
Phil Wilkins
 
Containers: The What, Why, and How
Sneha Inguva
 
PowerShell-1
Saravanan G
 
Introduction to virtualization
Sasikumar Thirumoorthy
 
An Introduction To Jenkins
Knoldus Inc.
 
An Introduction to Azure IaaS
Applied Information Sciences
 
Virtualization in cloud computing ppt
Mehul Patel
 
[Spring Camp 2018] 11번가 Spring Cloud 기반 MSA로의 전환 : 지난 1년간의 이야기
YongSung Yoon
 
Ad

Similar to High Performance Computing (HPC) in cloud (17)

PPTX
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
How containers helped a SaaS startup be developed and go live
Ramon Navarro
 
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
PDF
대용량 데이타 쉽고 빠르게 분석하기 :: 김일호 솔루션즈 아키텍트 :: Gaming on AWS 2016
Amazon Web Services Korea
 
PDF
Amazon EMR Masterclass
Ian Massingham
 
DOCX
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
PDF
Power Hadoop Cluster with AWS Cloud
Edureka!
 
PDF
Ess1000 glossary
Deepanshu Gupta
 
PDF
AWS glue technical enablement training
Info Alchemy Corporation
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
PDF
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
PDF
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
PDF
Secure Hadoop Cluster With Kerberos
Edureka!
 
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
How containers helped a SaaS startup be developed and go live
Ramon Navarro
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
대용량 데이타 쉽고 빠르게 분석하기 :: 김일호 솔루션즈 아키텍트 :: Gaming on AWS 2016
Amazon Web Services Korea
 
Amazon EMR Masterclass
Ian Massingham
 
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Ess1000 glossary
Deepanshu Gupta
 
AWS glue technical enablement training
Info Alchemy Corporation
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
Alejandro Fernandez
 
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
Secure Hadoop Cluster With Kerberos
Edureka!
 
Ad

More from Accubits Technologies (6)

PDF
AI-powered real-time video analytics for Manufacturing sector
Accubits Technologies
 
PDF
AI-powered real-time video analytics for defence sector
Accubits Technologies
 
PDF
Blockchain and IoT For Supply Chain Traceability
Accubits Technologies
 
PPTX
ICOs : past, present and future
Accubits Technologies
 
PPTX
Blockchain in Bioinformatics
Accubits Technologies
 
PPTX
Neural Networks - How do they work?
Accubits Technologies
 
AI-powered real-time video analytics for Manufacturing sector
Accubits Technologies
 
AI-powered real-time video analytics for defence sector
Accubits Technologies
 
Blockchain and IoT For Supply Chain Traceability
Accubits Technologies
 
ICOs : past, present and future
Accubits Technologies
 
Blockchain in Bioinformatics
Accubits Technologies
 
Neural Networks - How do they work?
Accubits Technologies
 

Recently uploaded (20)

PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Short term internship project report on power Bi
JMJCollegeComputerde
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Short term internship project report on power Bi
JMJCollegeComputerde
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 

High Performance Computing (HPC) in cloud

  • 2. Overview 1. Introduction to HPC 2. Hadoop (HDFS, MapReduce) 3. AWS toolkit (Amazon S3, Amazon EMR, Amazon Redshift) 4. Case study
  • 4. Large data files from sequencers. Computational bottleneck. Processing time. Data persistence and reliability. Data security. Bottlenecks in Genome Analysis
  • 6. Introduction “High Performance Computing (HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. ”
  • 7. Dedicated supercomputer. Commodity HPC cluster. Grid computing. HPC in cloud. Forms of HPC
  • 9. Hadoop Open source Java based framework for reliable, scalable and distributed computing. Doug Cutting and Mike Cafarella, 2006-08 in Yahoo!- inspired by Google (GFS) in 2003. Key Components Hadoop Distributed File System (HDFS) MapReduce
  • 10. Hadoop HDFS (Hadoop Distributed File System) Data management layer Master-Slave architecture Fault Tolerant Key Components: NameNode SecondaryNamenode
  • 11. Hadoop- Continued MapReduce Mappers and Reducers Batch oriented Key Components JobTracker TaskTracker
  • 13. AWS ToolKit - Amazon Elastic MapReduce (EMR) Managed Hadoop framework. Runs almost all popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink. Elastic. Flexible Data storage (S3, HDFS, RedShift, Glacier, RDS). Secure and reliable. Full control and root access.
  • 14. AWS ToolKit - Amazon EMR aws emr create-cluster --name "demo" --release-label emr-4.5.0 --instance-type m3.xlarge --instance-count 2 --ec2-attributes KeyName=YOUR-AWS-SSH-KEY --use-default-roles --applications Name=Hive Name=Spark
  • 15. aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE, Args=[-f,s3://mybucket/scripts/pigscript.pig,-p, INPUT=s3://mybucket/inputdata/,-p, OUTPUT=s3://mybucket/outputdata/, $INPUT=s3://mybucket/inputdata/, $OUTPUT=s3://mybucket/outputdata/]
  • 16. AWS ToolKit - Amazon S3 (Simple Storage Service) Virtually infinite storage. Single object size up to 5TB. Why use S3? Durable, Low Cost, Scalable, High Performance, Secure, Integrated, Easy to Use. Decouple storage and computation resources. HDFS requirements and implements EMRFS.
  • 17. AWS ToolKit - Amazon Redshift Fast, simple petabyte-scale data warehouse. Use SQL query to interact. Massively parallel. Relational. Architecture - Leader Node and Compute node. Fast - 4 GB/sec/node.
  • 18. Case Study- Rail RNA Cloud-enabled spliced aligner that analyzes many samples at once. Architecture - Amazon S3, Amazon EMR. ~50000 (from NCBI archive) human RNA sample using Rail-RNA - 150 Tbps. Input to result - 2 weeks. Cost- ~$1.40/sample. Paper- Splicing across SRA.