Maintainable cloud architecture_of_hadoop

Maintainable Cloud
Architecture of Hadoop
Kai Sasaki
Treasure Data Inc.

Who am I?
• Kai Sasaki (佐々木海)
• @Lewuathe at Twitter, GitHub
• Treasure Data Inc. 
Software Engineer
• Contributing Hadoop, Spark.

Cloud-based Data warehousing service

Hadoop is the core of Treasure Data

Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity

Hadoop on Cloud
1. Features provided by AWS, IDCF, Heroku etc
2. Fast growing reliability and integrity
Maintainability of Middleware

Agenda
• Maintainability of Distributed System
• Our Challenges
• Stateless Hive Metastore
• Cloud Storage for Hadoop
• Multiple Hadoop Version Management
• Regression Test for Hive Queries
• REST API for Hadoop
• Workﬂow Integration
• What we should keep in mind

Maintainability
We think high maintainability is achieved by…
• Stateless

Maintainability
• Stateless
• Mobility

Maintainability
• Stateless
• Mobility
• Queueing

Stateless
• Stateless Hive metastore

Stateful Hive MS
Driver Metastore
MySQL

Stateful Hive MS
Driver Metastore
MySQL
Require Maintaining RDBMS for only Meta Store

Stateless Hive MS
Driver Metastore

Stateless Hive MS
Driver Metastore Derby

Stateless Hive MS
Worker
Submit DDL
request

Stateless Hive MS
Worker
Submit DDL
request
Aggregate Stateful points
Treasure Data
API

PlazmaDB
Data
Connector
S3, Redshift, MySQL,
PostgreSQL, Salesforce and more
SDK
iOS, Android, JavaScript 
Unity
Bulk Import td client
...

PlazmaDB
Data
Connector
SDK
Unity
...
msgpack

PlazmaDB
Data
Connector
SDK
Unity
...
msgpack
Hadoop

PlazmaDB
Data
Connector
SDK
Unity
...
msgpack
Hadoop
Stateful

PlazmaDB
PostgreSQL
S3
or
Riak
S3
or
Riak
S3
or
Riak
S3
or
Riak
msgpack
Amazon RDS

PlazmaDB
PostgreSQL
S3
or
Riak
S3
or
Riak
S3
or
Riak
S3
or
Riak
msgpack
Amazon RDS
Transaction Immutable

Mobility
• Multiple Hadoop Version Management
• Regression Test for Hive Queries

Multiple Hadoop
Version Management

Multiple Version  
Management
CDH HDP Apache

Management
CDH HDP Apache
client client client

Management
CDH HDP Apache
client client client
Tough Operation

Management
CDH HDP Apache
Worker

Management
CDH HDP Apache
Worker
switching

Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
switching

Management
CDH HDP Apache
Worker
CDH package
HDP package
Apache package
S3
switching

Multiple Version
Management
S3
/test
/stable
...

Multiple Version
Management
CDH package
HDP package
Apache package
S3
/test
/stable
...

Multiple Version
Management
CDH package
HDP package
Apache package
S3
/test
/stable
...
CDH
HDP
Apache
Worker
download

Regression Test for Hive
• Introducing new features, version up, migration 
must be done without regression
• Running integration system test and regression test
for Hive queries

CDH
HDP
Apache
Worker
http://blog.circleci.com/meet-our-new-logo/
System Test
Repository

CDH
HDP
Apache
Worker
System Test
Repository
S3
Hadoop
Repository

CDH
HDP
Apache
Worker
System Test
Repository
S3
Apache package
Hadoop
Repository

Queueing
• RDS based Queue management system

REST API for Hadoop
CDH HDP Apache
Worker

REST API for Hadoop
CDH HDP Apache
Worker PerfectQueue
Hadoop Job
Server
REST API

REST API for Hadoop
CDH HDP Apache
Worker PerfectQueue
Hadoop Job
Server
REST API
Presto
API

RDBMS-based Queue
Management System

RDBMS based
queue management
CDH HDP Apache
Worker
Client Client Client
PerfectQueue
Hadoop Job
Server

PerfectQueue
• Highly available distributed queue build on RDBMS
• Amazon SQS like API
• Resource scheduling for multi tenancy
• Graceful and Live Restarting
https://github.com/treasure-data/perfectqueue

What we should  
keep in mind
• Stateless 
Delegate responsibility to Cloud systems
• Mobility 
Looking ahead for version up, migration
• Queueing 
Make each request persistent

Recap
• Maintainability of Distributed System
• Our Challenges
• Stateless Hive Metastore
• Multiple Hadoop version management
• Regression Test for Hive queries
• Workﬂow Integration
• What we should keep in mind

https://www.treasuredata.com/

Maintainable cloud architecture_of_hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Maintainable cloud architecture_of_hadoop (20)

More from Kai Sasaki (20)

Recently uploaded (20)

Maintainable cloud architecture_of_hadoop