Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
Back to School – Hadoop Ecosystem Overview

© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Roadmap
• What is Apache?
• Hadoop Timeline and Level Set
• Hadoop Suite of tools
1. Hive
2. Sqoop
3. Pig
4. Oozie
5. Hbase
6. Flume
7. Kafka
8. Drill
9. Yarn
10. Zookeeper
• Use Cases
• Q&A

What is Apache?
• Non-profit organization
• Governs the development of open source “Projects”
• “Top Level” projects are the most prominent
• Features “committers” from all over the world

Hadoop Timeline
2003
GFS White Paper Published
2004
Map Reduce White Paper Published
2006
Hadoop is born
HDFS + MapReduce
2009
Hadoop distributions start
popping up
2016
Organized Chaos – New projects
released every few months and
only the winners gain traction
2007 - Present
Hadoop continuously evolves.
New tools are released to improve
usability and make it easier to adopt.
2000 2020

What is Hadoop?
Distributed File System + Processing Engine
HDFS Map Reduce

What is MapReduce?
• Three phase program built for distributed processing
– Map
– Shuffle/Sort
– Reduce
• Processing overhead associated with MR jobs(~30 seconds)
• Heavy disk usage

1.) Hive
• First SQL on Hadoop – HiveQL is the language
• Hadoop data warehousing tool
• Converts HiveQL into a Map Reduce job
• Bash, Java, and Python scripts can execute Hive commands
• Not ANSI compliant but VERY similar
Use Hive for long running jobs -- not ad-hoc queries

2.) Sqoop
• RDBMS connector for Hadoop
• Execute Sqoop scripts via the command line
• Sqoop can move Schemas, Tables, or Select statement results
• Helps improve ETL or enable data warehouse offload
Use Sqoop anytime data needs to move to/from an RDBMS

3.) Pig
• High level coding language for processing data
• Language used to express data flows is called Pig Latin
• Pig turns data flows into a series of MR jobs
• Can run in a single JVM or on a Hadoop Cluster
• User Defined Functions(UDFs) make Pig code easy to repurpose
Use pig to speed up development process

4.) Oozie
• Workflow Orchestration
• Schedule tasks to be completed based on time or completion of a
previous task
• Used for Automation
• Develop these workflows either in a GUI or in XML
– Hint: the GUI is much much MUCH simpler
Use Oozie when you need workflows

5.) Hbase
• Database built on HDFS
• Meant for big and fast data
• Hbase is a NoSQL database
– Multiple types of NoSQL databases:
• Wide-column stores, Document DB, Graph DB, Key-Value stores
• Hbase is a wide-column store
Use Hbase when “real-time read/write access to very large
datasets” is required

6.) Flume
• Meant for ingesting streams of data
• Runs on the same cluster and stores data in HDFS
– Also flexible enough to stream into Hbase or SolR
• Flume PUSHES data to its destination
• Flume does NOT store data within itself
Use Flume when basic streaming is required

7.) Kafka
• …Also meant for ingesting streams of data
• Runs on its own cluster
• Kafka does not PUSH data to other places
– Other places pull from Kafka
• Kafka streams in the data, then PUBLISHES the data on its
cluster and multiple users can SUBSCRIBE to that data and get
their copy.
Use Kafka for advanced streaming

8.) Drill
• Flexible SQL tool
• Works with a lot of data types and storage platforms
• Does not require transformations to the data
• For ad-hoc analytics and performant queries on LARGE data sets
• Scales to thousands of nodes
Use Drill for data exploration and performant SQL

9.) Yarn
• Yet Another Resource Negotiator
• Helps you allocate resources (and enforce usage quotas) to
multiple groups/users
10.) Zookeeper
• Coordinates the distribution of jobs
• Handles partial failures
• Provides synchronization of jobs
Use Yarn for Multitenancy
ALWAYS use Zookeeper with Hadoop

Use Case 1: Expensive RDBMS
• Organization has 5 TB of sales
data in RDBMS ($$$)
• Currently 50 reports being
generated regularly
• Largest report takes ~24 hours to
generate
• Team only knows SQL
HDFS
Sqoop
Hive
Hive/Drill

Use Case 2: Customer 360 Data Lake/Hub
• 50 TB of customer data
• Data consists of everything from ERP data
to JSON data from a rest API
• Four different business units need access
to the data and they each have
performance requirements
• Basic users need ad-hoc query capabilities
• Weekly jobs need to be kicked off during off
hours
HDFS
Drill
YARN
Drill
Oozie

Use Case 3: Online Video Game Support
• Stats need to be updated milliseconds
after the game finishes
• Player needs to be able to randomly look
up other player stats in less than a second
• System can never go down or lose
information
• Management wants to save this data so
analytics can be run on these datasets.
Kafka/Flume & Hbase
Hbase
Kafka & Hbase
HDFS

Advice for those getting started…
• Don’t try to hire a big data team, build from within
– MOTIVATED Linux and SQL people are enough to get started
• Target legacy RDMBS and move ~80% to Hadoop
– Quick win
– Instant validation and justification if you can cut costs and improve speed
at the same time
• Have fun 

Additional Resources
• Full List of Hadoop Ecosystem
• Books:
– The Definitive Guide to Hadoop
– Hadoop Application Architectures
• Free Training:
– Coursera and Edx
• My favorite is a Python specialization series
– learn.mapr.com
• Free courses from 100 level to 400 level

Q&A
@mapr maprtech
matthewmiller@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Back to School - St. Louis Hadoop Meetup September 2016

More Related Content

What's hot

Viewers also liked

Similar to Back to School - St. Louis Hadoop Meetup September 2016

More from Adam Doyle

Recently uploaded

Back to School - St. Louis Hadoop Meetup September 2016