SlideShare a Scribd company logo
Maintainable Cloud
Architecture of Hadoop
Kai Sasaki
Treasure Data Inc.
Who am I?
• Kai Sasaki (佐々木 海)
• @Lewuathe at Twitter, GitHub
• Treasure Data Inc.

Software Engineer
• Contributing Hadoop, Spark.
Hadoop in Treasure Data
Cloud-based Data warehousing service
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Hadoop is the core of Treasure Data
Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity
Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity
Maintainability of Middleware
Agenda
• Maintainability of Distributed System
• Our Challenges
• Stateless Hive Metastore
• Cloud Storage for Hadoop
• Multiple Hadoop Version Management
• Regression Test for Hive Queries
• REST API for Hadoop
• Workflow Integration
• What we should keep in mind
Maintainability
We think high maintainability is achieved by…
• Stateless
Maintainability
We think high maintainability is achieved by…
• Stateless
• Mobility
Maintainability
We think high maintainability is achieved by…
• Stateless
• Mobility
• Queueing
Stateless
• Stateless Hive metastore
• Cloud Storage for Hadoop
Stateless Hive MS
Stateful Hive MS
MySQL
Stateful Hive MS
Driver Metastore
MySQL
Stateful Hive MS
Driver Metastore
MySQL
Require Maintaining RDBMS for only Meta Store
Stateless Hive MS
Driver Metastore
Stateless Hive MS
Driver Metastore Derby
Stateless Hive MS
Driver Metastore Derby
Worker
Submit DDL
request
Stateless Hive MS
Driver Metastore Derby
Worker
Submit DDL
request
Aggregate Stateful points
Treasure Data
API
Cloud Storage for Hadoop
PlazmaDB
Data
Connector
S3, Redshift, MySQL,
PostgreSQL, Salesforce and more
SDK
iOS, Android, JavaScript

Unity
Bulk Import td client
...
PlazmaDB
Data
Connector
S3, Redshift, MySQL,
PostgreSQL, Salesforce and more
SDK
iOS, Android, JavaScript

Unity
Bulk Import td client
...
msgpack
PlazmaDB
Data
Connector
S3, Redshift, MySQL,
PostgreSQL, Salesforce and more
SDK
iOS, Android, JavaScript

Unity
Bulk Import td client
...
msgpack
Hadoop
PlazmaDB
Data
Connector
S3, Redshift, MySQL,
PostgreSQL, Salesforce and more
SDK
iOS, Android, JavaScript

Unity
Bulk Import td client
...
msgpack
Hadoop
Stateful
PlazmaDB
PostgreSQL
S3
or
Riak
S3
or
Riak
S3
or
Riak
S3
or
Riak
msgpack
Amazon RDS
PlazmaDB
PostgreSQL
S3
or
Riak
S3
or
Riak
S3
or
Riak
S3
or
Riak
msgpack
Amazon RDS
Transaction Immutable
Mobility
• Multiple Hadoop Version Management
• Regression Test for Hive Queries
Multiple Hadoop
Version Management
Multiple Version 

Management
CDH HDP Apache
Multiple Version 

Management
CDH HDP Apache
client client client
Multiple Version 

Management
CDH HDP Apache
client client client
Tough Operation
Multiple Version 

Management
CDH HDP Apache
Worker
Multiple Version 

Management
CDH HDP Apache
Worker
switching
Multiple Version 

Management
CDH HDP Apache
Worker
switching
Multiple Version 

Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
switching
Multiple Version 

Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
S3
switching
Multiple Version
Management
S3
/test
/stable
...
Multiple Version
Management
CDH package
HDP package
Apache package
S3
/test
/stable
...
Multiple Version
Management
CDH package
HDP package
Apache package
S3
/test
/stable
...
CDH
HDP
Apache
Worker
download
Regression Test for Hive
• Introducing new features, version up, migration

must be done without regression
• Running integration system test and regression test
for Hive queries
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository
S3
Hadoop
Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository
S3
Apache package
Hadoop
Repository
CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository
S3
Apache package
Hadoop
Repository
Queueing
• REST API for Hadoop
• RDS based Queue management system
REST API for Hadoop
REST API for Hadoop
CDH HDP Apache
Worker
REST API for Hadoop
CDH HDP Apache
Worker PerfectQueue
Hadoop Job
Server
REST API
REST API for Hadoop
CDH HDP Apache
Worker PerfectQueue
Hadoop Job
Server
REST API
Presto
API
RDBMS-based Queue
Management System
RDBMS based
queue management
CDH HDP Apache
Worker
Client Client Client
PerfectQueue
Hadoop Job
Server
PerfectQueue
• Highly available distributed queue build on RDBMS
• Amazon SQS like API
• Resource scheduling for multi tenancy
• Graceful and Live Restarting
https://github.com/treasure-data/perfectqueue
What we should 

keep in mind
• Stateless

Delegate responsibility to Cloud systems
• Mobility

Looking ahead for version up, migration
• Queueing

Make each request persistent
Recap
• Maintainability of Distributed System
• Our Challenges
• Stateless Hive Metastore
• Cloud Storage for Hadoop
• Multiple Hadoop version management
• Regression Test for Hive queries
• REST API for Hadoop
• Workflow Integration
• What we should keep in mind
https://www.treasuredata.com/

More Related Content

What's hot (20)

PDF
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
PPTX
Zero ETL analytics with LLAP in Azure HDInsight
Ashish Thapliyal
 
PDF
Karmasphere Studio for Hadoop
Hadoop User Group
 
PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PPTX
HDInsight for Architects
Ashish Thapliyal
 
PDF
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
PDF
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
PDF
HUG August 2010: Best practices
Hadoop User Group
 
PDF
Foss evolution cos-boudnik
Data Con LA
 
PPTX
La big datacamp2014_vikram_dixit
Data Con LA
 
PPTX
Simplified Cluster Operation & Troubleshooting
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
PPTX
Introduction to Apache Kudu
Jeff Holoman
 
PPTX
Building Big Data Applications using Spark, Hive, HBase and Kafka
Ashish Thapliyal
 
PPTX
Azure HDInsight
Ashish Thapliyal
 
PDF
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
In-Memory Computing Summit
 
PPTX
Flexible compute
Peter Clapham
 
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Zero ETL analytics with LLAP in Azure HDInsight
Ashish Thapliyal
 
Karmasphere Studio for Hadoop
Hadoop User Group
 
Next Generation Hadoop Operations
Owen O'Malley
 
HDInsight for Architects
Ashish Thapliyal
 
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
HUG August 2010: Best practices
Hadoop User Group
 
Foss evolution cos-boudnik
Data Con LA
 
La big datacamp2014_vikram_dixit
Data Con LA
 
Simplified Cluster Operation & Troubleshooting
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Introduction to Apache Kudu
Jeff Holoman
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Ashish Thapliyal
 
Azure HDInsight
Ashish Thapliyal
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
In-Memory Computing Summit
 
Flexible compute
Peter Clapham
 
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 

Viewers also liked (20)

PPTX
sparksql-hive-bench-by-nec-hwx-at-hcj16
Yifeng Jiang
 
PDF
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
 
PPTX
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
Makoto SHIMURA
 
PDF
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
 
PDF
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-
Keigo Suda
 
PDF
Apache drillを業務利用してみる(までの道のり)
Keigo Suda
 
PDF
オライリーセミナー Hive入門 #oreilly0724
Cloudera Japan
 
PDF
セグメンテーションの考え方・使い方 - TokyoR #44
horihorio
 
PDF
Embulk makes Japan visible
Kai Sasaki
 
PDF
Spark Streamingで作る、つぶやきビッグデータのクローン(Hadoop Spark Conference Japan 2016版)
Junichi Noda
 
PDF
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
Yahoo!デベロッパーネットワーク
 
PDF
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Edureka!
 
PDF
本当にあったHadoopの恐い話 Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...
NTT DATA OSS Professional Services
 
PDF
統計と会計 - Zansa#19
horihorio
 
PDF
僕の考える最強のビックデータエンジニア
Yu Yamada
 
PDF
サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)
NTT DATA OSS Professional Services
 
PDF
Hadoop Security
Timothy Spann
 
PPTX
Big Data/Hadoop Option Analysis
zafarali1981
 
PDF
金融機関でのHive/Presto事例紹介
Amazon Web Services Japan
 
PDF
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
オラクルエンジニア通信
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
Yifeng Jiang
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
 
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
Makoto SHIMURA
 
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
 
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-
Keigo Suda
 
Apache drillを業務利用してみる(までの道のり)
Keigo Suda
 
オライリーセミナー Hive入門 #oreilly0724
Cloudera Japan
 
セグメンテーションの考え方・使い方 - TokyoR #44
horihorio
 
Embulk makes Japan visible
Kai Sasaki
 
Spark Streamingで作る、つぶやきビッグデータのクローン(Hadoop Spark Conference Japan 2016版)
Junichi Noda
 
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
Yahoo!デベロッパーネットワーク
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Edureka!
 
本当にあったHadoopの恐い話 Blockはどこへきえた? (Hadoop / Spark Conference Japan 2016 ライトニングトー...
NTT DATA OSS Professional Services
 
統計と会計 - Zansa#19
horihorio
 
僕の考える最強のビックデータエンジニア
Yu Yamada
 
サポートメンバは見た! Hadoopバグワースト10 (adoop / Spark Conference Japan 2016 ライトニングトーク発表資料)
NTT DATA OSS Professional Services
 
Hadoop Security
Timothy Spann
 
Big Data/Hadoop Option Analysis
zafarali1981
 
金融機関でのHive/Presto事例紹介
Amazon Web Services Japan
 
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
オラクルエンジニア通信
 
Ad

Similar to Maintainable cloud architecture_of_hadoop (20)

PDF
Infrastructure Around Hadoop
DataWorks Summit
 
PDF
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
StampedeCon
 
PPTX
HDFS tiered storage
DataWorks Summit
 
PDF
Baking Stash in the AWS Cloud at Netflix
Atlassian
 
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
PPTX
DEVNET-1166 Open SDN Controller APIs
Cisco DevNet
 
PDF
Building CI from scratch
Artem Nikitin
 
PPTX
DR_PRESENT 1
Ahmed Salman
 
PPTX
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
PPTX
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PDF
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
PDF
Bi with apache hadoop(en)
Alexander Alten
 
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
DOC
Robin_Hadoop
Robin David
 
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
PPTX
GDPR compliance application architecture and implementation using Hadoop and ...
DataWorks Summit
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
PPTX
Big data application using hadoop in cloud [Smart Refrigerator]
Pushkar Bhandari
 
Infrastructure Around Hadoop
DataWorks Summit
 
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
StampedeCon
 
HDFS tiered storage
DataWorks Summit
 
Baking Stash in the AWS Cloud at Netflix
Atlassian
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
DEVNET-1166 Open SDN Controller APIs
Cisco DevNet
 
Building CI from scratch
Artem Nikitin
 
DR_PRESENT 1
Ahmed Salman
 
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
S S
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
Bi with apache hadoop(en)
Alexander Alten
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
Robin_Hadoop
Robin David
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
GDPR compliance application architecture and implementation using Hadoop and ...
DataWorks Summit
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Big data application using hadoop in cloud [Smart Refrigerator]
Pushkar Bhandari
 
Ad

More from Kai Sasaki (20)

PDF
Graviton 2で実現する
コスト効率のよいCDP基盤
Kai Sasaki
 
PDF
Infrastructure for auto scaling distributed system
Kai Sasaki
 
PDF
Continuous Optimization for Distributed BigData Analysis
Kai Sasaki
 
PDF
Recent Changes and Challenges for Future Presto
Kai Sasaki
 
PDF
Real World Storage in Treasure Data
Kai Sasaki
 
PDF
20180522 infra autoscaling_system
Kai Sasaki
 
PDF
User Defined Partitioning on PlazmaDB
Kai Sasaki
 
PDF
Deep dive into deeplearn.js
Kai Sasaki
 
PDF
Optimizing Presto Connector on Cloud Storage
Kai Sasaki
 
PDF
Presto updates to 0.178
Kai Sasaki
 
PPTX
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
PDF
図でわかるHDFS Erasure Coding
Kai Sasaki
 
PDF
Spark MLlib code reading ~optimization~
Kai Sasaki
 
PDF
How I tried MADE
Kai Sasaki
 
PDF
Reading kernel org
Kai Sasaki
 
PDF
Reading drill
Kai Sasaki
 
PDF
Kernel ext4
Kai Sasaki
 
PDF
Kernel bootstrap
Kai Sasaki
 
PDF
HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案
Kai Sasaki
 
PDF
Kernel resource
Kai Sasaki
 
Graviton 2で実現する
コスト効率のよいCDP基盤
Kai Sasaki
 
Infrastructure for auto scaling distributed system
Kai Sasaki
 
Continuous Optimization for Distributed BigData Analysis
Kai Sasaki
 
Recent Changes and Challenges for Future Presto
Kai Sasaki
 
Real World Storage in Treasure Data
Kai Sasaki
 
20180522 infra autoscaling_system
Kai Sasaki
 
User Defined Partitioning on PlazmaDB
Kai Sasaki
 
Deep dive into deeplearn.js
Kai Sasaki
 
Optimizing Presto Connector on Cloud Storage
Kai Sasaki
 
Presto updates to 0.178
Kai Sasaki
 
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
図でわかるHDFS Erasure Coding
Kai Sasaki
 
Spark MLlib code reading ~optimization~
Kai Sasaki
 
How I tried MADE
Kai Sasaki
 
Reading kernel org
Kai Sasaki
 
Reading drill
Kai Sasaki
 
Kernel ext4
Kai Sasaki
 
Kernel bootstrap
Kai Sasaki
 
HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案
Kai Sasaki
 
Kernel resource
Kai Sasaki
 

Recently uploaded (20)

PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PPTX
Employee salary prediction using Machine learning Project template.ppt
bhanuk27082004
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PDF
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
Employee salary prediction using Machine learning Project template.ppt
bhanuk27082004
 
Presentation about variables and constant.pptx
kr2589474
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Activate_Methodology_Summary presentatio
annapureddyn
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 

Maintainable cloud architecture_of_hadoop