© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
Back to School – Hadoop Ecosystem Overview
© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Roadmap
• What is Apache?
• Hadoop Timeline and Level Set
• Hadoop Suite of tools
1. Hive
2. Sqoop
3. Pig
4. Oozie
5. Hbase
6. Flume
7. Kafka
8. Drill
9. Yarn
10. Zookeeper
• Use Cases
• Q&A
© 2016 MapR Technologies 3© 2016 MapR Technologies 3
What is Apache?
• Non-profit organization
• Governs the development of open source “Projects”
• “Top Level” projects are the most prominent
• Features “committers” from all over the world
© 2016 MapR Technologies 4© 2016 MapR Technologies 4
Hadoop Timeline
2003
GFS White Paper Published
2004
Map Reduce White Paper Published
2006
Hadoop is born
HDFS + MapReduce
2009
Hadoop distributions start
popping up
2016
Organized Chaos – New projects
released every few months and
only the winners gain traction
2007 - Present
Hadoop continuously evolves.
New tools are released to improve
usability and make it easier to adopt.
2000 2020
© 2016 MapR Technologies 5© 2016 MapR Technologies 5
What is Hadoop?
Distributed File System + Processing Engine
HDFS Map Reduce
© 2016 MapR Technologies 6© 2016 MapR Technologies 6
What is MapReduce?
• Three phase program built for distributed processing
– Map
– Shuffle/Sort
– Reduce
• Processing overhead associated with MR jobs(~30 seconds)
• Heavy disk usage
© 2016 MapR Technologies 7© 2016 MapR Technologies 7
1.) Hive
• First SQL on Hadoop – HiveQL is the language
• Hadoop data warehousing tool
• Converts HiveQL into a Map Reduce job
• Bash, Java, and Python scripts can execute Hive commands
• Not ANSI compliant but VERY similar
Use Hive for long running jobs -- not ad-hoc queries
© 2016 MapR Technologies 8© 2016 MapR Technologies 8
2.) Sqoop
• RDBMS connector for Hadoop
• Execute Sqoop scripts via the command line
• Sqoop can move Schemas, Tables, or Select statement results
• Helps improve ETL or enable data warehouse offload
Use Sqoop anytime data needs to move to/from an RDBMS
© 2016 MapR Technologies 9© 2016 MapR Technologies 9
3.) Pig
• High level coding language for processing data
• Language used to express data flows is called Pig Latin
• Pig turns data flows into a series of MR jobs
• Can run in a single JVM or on a Hadoop Cluster
• User Defined Functions(UDFs) make Pig code easy to repurpose
Use pig to speed up development process
© 2016 MapR Technologies 10© 2016 MapR Technologies 10
4.) Oozie
• Workflow Orchestration
• Schedule tasks to be completed based on time or completion of a
previous task
• Used for Automation
• Develop these workflows either in a GUI or in XML
– Hint: the GUI is much much MUCH simpler
Use Oozie when you need workflows
© 2016 MapR Technologies 11© 2016 MapR Technologies 11
5.) Hbase
• Database built on HDFS
• Meant for big and fast data
• Hbase is a NoSQL database
– Multiple types of NoSQL databases:
• Wide-column stores, Document DB, Graph DB, Key-Value stores
• Hbase is a wide-column store
Use Hbase when “real-time read/write access to very large
datasets” is required
© 2016 MapR Technologies 12© 2016 MapR Technologies 12
6.) Flume
• Meant for ingesting streams of data
• Runs on the same cluster and stores data in HDFS
– Also flexible enough to stream into Hbase or SolR
• Flume PUSHES data to its destination
• Flume does NOT store data within itself
Use Flume when basic streaming is required
© 2016 MapR Technologies 13© 2016 MapR Technologies 13
7.) Kafka
• …Also meant for ingesting streams of data
• Runs on its own cluster
• Kafka does not PUSH data to other places
– Other places pull from Kafka
• Kafka streams in the data, then PUBLISHES the data on its
cluster and multiple users can SUBSCRIBE to that data and get
their copy.
Use Kafka for advanced streaming
© 2016 MapR Technologies 14© 2016 MapR Technologies 14
8.) Drill
• Flexible SQL tool
• Works with a lot of data types and storage platforms
• Does not require transformations to the data
• For ad-hoc analytics and performant queries on LARGE data sets
• Scales to thousands of nodes
Use Drill for data exploration and performant SQL
© 2016 MapR Technologies 15© 2016 MapR Technologies 15
9.) Yarn
• Yet Another Resource Negotiator
• Helps you allocate resources (and enforce usage quotas) to
multiple groups/users
10.) Zookeeper
• Coordinates the distribution of jobs
• Handles partial failures
• Provides synchronization of jobs
Use Yarn for Multitenancy
ALWAYS use Zookeeper with Hadoop
© 2016 MapR Technologies 16© 2016 MapR Technologies 16
Use Case 1: Expensive RDBMS
• Organization has 5 TB of sales
data in RDBMS ($$$)
• Currently 50 reports being
generated regularly
• Largest report takes ~24 hours to
generate
• Team only knows SQL
HDFS
Sqoop
Hive
Hive/Drill
© 2016 MapR Technologies 17© 2016 MapR Technologies 17
Use Case 2: Customer 360 Data Lake/Hub
• 50 TB of customer data
• Data consists of everything from ERP data
to JSON data from a rest API
• Four different business units need access
to the data and they each have
performance requirements
• Basic users need ad-hoc query capabilities
• Weekly jobs need to be kicked off during off
hours
HDFS
Drill
YARN
Drill
Oozie
© 2016 MapR Technologies 18© 2016 MapR Technologies 18
Use Case 3: Online Video Game Support
• Stats need to be updated milliseconds
after the game finishes
• Player needs to be able to randomly look
up other player stats in less than a second
• System can never go down or lose
information
• Management wants to save this data so
analytics can be run on these datasets.
Kafka/Flume & Hbase
Hbase
Kafka & Hbase
HDFS
© 2016 MapR Technologies 19© 2016 MapR Technologies 19
Advice for those getting started…
• Don’t try to hire a big data team, build from within
– MOTIVATED Linux and SQL people are enough to get started
• Target legacy RDMBS and move ~80% to Hadoop
– Quick win
– Instant validation and justification if you can cut costs and improve speed
at the same time
• Have fun 
© 2016 MapR Technologies 20© 2016 MapR Technologies 20
Additional Resources
• Full List of Hadoop Ecosystem
• Books:
– The Definitive Guide to Hadoop
– Hadoop Application Architectures
• Free Training:
– Coursera and Edx
• My favorite is a Python specialization series
– learn.mapr.com
• Free courses from 100 level to 400 level
© 2016 MapR Technologies 21© 2016 MapR Technologies 21
Q&A
@mapr maprtech
matthewmiller@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Back to School - St. Louis Hadoop Meetup September 2016

  • 1.
    © 2016 MapRTechnologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies Back to School – Hadoop Ecosystem Overview
  • 2.
    © 2016 MapRTechnologies 2© 2016 MapR Technologies 2 Roadmap • What is Apache? • Hadoop Timeline and Level Set • Hadoop Suite of tools 1. Hive 2. Sqoop 3. Pig 4. Oozie 5. Hbase 6. Flume 7. Kafka 8. Drill 9. Yarn 10. Zookeeper • Use Cases • Q&A
  • 3.
    © 2016 MapRTechnologies 3© 2016 MapR Technologies 3 What is Apache? • Non-profit organization • Governs the development of open source “Projects” • “Top Level” projects are the most prominent • Features “committers” from all over the world
  • 4.
    © 2016 MapRTechnologies 4© 2016 MapR Technologies 4 Hadoop Timeline 2003 GFS White Paper Published 2004 Map Reduce White Paper Published 2006 Hadoop is born HDFS + MapReduce 2009 Hadoop distributions start popping up 2016 Organized Chaos – New projects released every few months and only the winners gain traction 2007 - Present Hadoop continuously evolves. New tools are released to improve usability and make it easier to adopt. 2000 2020
  • 5.
    © 2016 MapRTechnologies 5© 2016 MapR Technologies 5 What is Hadoop? Distributed File System + Processing Engine HDFS Map Reduce
  • 6.
    © 2016 MapRTechnologies 6© 2016 MapR Technologies 6 What is MapReduce? • Three phase program built for distributed processing – Map – Shuffle/Sort – Reduce • Processing overhead associated with MR jobs(~30 seconds) • Heavy disk usage
  • 7.
    © 2016 MapRTechnologies 7© 2016 MapR Technologies 7 1.) Hive • First SQL on Hadoop – HiveQL is the language • Hadoop data warehousing tool • Converts HiveQL into a Map Reduce job • Bash, Java, and Python scripts can execute Hive commands • Not ANSI compliant but VERY similar Use Hive for long running jobs -- not ad-hoc queries
  • 8.
    © 2016 MapRTechnologies 8© 2016 MapR Technologies 8 2.) Sqoop • RDBMS connector for Hadoop • Execute Sqoop scripts via the command line • Sqoop can move Schemas, Tables, or Select statement results • Helps improve ETL or enable data warehouse offload Use Sqoop anytime data needs to move to/from an RDBMS
  • 9.
    © 2016 MapRTechnologies 9© 2016 MapR Technologies 9 3.) Pig • High level coding language for processing data • Language used to express data flows is called Pig Latin • Pig turns data flows into a series of MR jobs • Can run in a single JVM or on a Hadoop Cluster • User Defined Functions(UDFs) make Pig code easy to repurpose Use pig to speed up development process
  • 10.
    © 2016 MapRTechnologies 10© 2016 MapR Technologies 10 4.) Oozie • Workflow Orchestration • Schedule tasks to be completed based on time or completion of a previous task • Used for Automation • Develop these workflows either in a GUI or in XML – Hint: the GUI is much much MUCH simpler Use Oozie when you need workflows
  • 11.
    © 2016 MapRTechnologies 11© 2016 MapR Technologies 11 5.) Hbase • Database built on HDFS • Meant for big and fast data • Hbase is a NoSQL database – Multiple types of NoSQL databases: • Wide-column stores, Document DB, Graph DB, Key-Value stores • Hbase is a wide-column store Use Hbase when “real-time read/write access to very large datasets” is required
  • 12.
    © 2016 MapRTechnologies 12© 2016 MapR Technologies 12 6.) Flume • Meant for ingesting streams of data • Runs on the same cluster and stores data in HDFS – Also flexible enough to stream into Hbase or SolR • Flume PUSHES data to its destination • Flume does NOT store data within itself Use Flume when basic streaming is required
  • 13.
    © 2016 MapRTechnologies 13© 2016 MapR Technologies 13 7.) Kafka • …Also meant for ingesting streams of data • Runs on its own cluster • Kafka does not PUSH data to other places – Other places pull from Kafka • Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy. Use Kafka for advanced streaming
  • 14.
    © 2016 MapRTechnologies 14© 2016 MapR Technologies 14 8.) Drill • Flexible SQL tool • Works with a lot of data types and storage platforms • Does not require transformations to the data • For ad-hoc analytics and performant queries on LARGE data sets • Scales to thousands of nodes Use Drill for data exploration and performant SQL
  • 15.
    © 2016 MapRTechnologies 15© 2016 MapR Technologies 15 9.) Yarn • Yet Another Resource Negotiator • Helps you allocate resources (and enforce usage quotas) to multiple groups/users 10.) Zookeeper • Coordinates the distribution of jobs • Handles partial failures • Provides synchronization of jobs Use Yarn for Multitenancy ALWAYS use Zookeeper with Hadoop
  • 16.
    © 2016 MapRTechnologies 16© 2016 MapR Technologies 16 Use Case 1: Expensive RDBMS • Organization has 5 TB of sales data in RDBMS ($$$) • Currently 50 reports being generated regularly • Largest report takes ~24 hours to generate • Team only knows SQL HDFS Sqoop Hive Hive/Drill
  • 17.
    © 2016 MapRTechnologies 17© 2016 MapR Technologies 17 Use Case 2: Customer 360 Data Lake/Hub • 50 TB of customer data • Data consists of everything from ERP data to JSON data from a rest API • Four different business units need access to the data and they each have performance requirements • Basic users need ad-hoc query capabilities • Weekly jobs need to be kicked off during off hours HDFS Drill YARN Drill Oozie
  • 18.
    © 2016 MapRTechnologies 18© 2016 MapR Technologies 18 Use Case 3: Online Video Game Support • Stats need to be updated milliseconds after the game finishes • Player needs to be able to randomly look up other player stats in less than a second • System can never go down or lose information • Management wants to save this data so analytics can be run on these datasets. Kafka/Flume & Hbase Hbase Kafka & Hbase HDFS
  • 19.
    © 2016 MapRTechnologies 19© 2016 MapR Technologies 19 Advice for those getting started… • Don’t try to hire a big data team, build from within – MOTIVATED Linux and SQL people are enough to get started • Target legacy RDMBS and move ~80% to Hadoop – Quick win – Instant validation and justification if you can cut costs and improve speed at the same time • Have fun 
  • 20.
    © 2016 MapRTechnologies 20© 2016 MapR Technologies 20 Additional Resources • Full List of Hadoop Ecosystem • Books: – The Definitive Guide to Hadoop – Hadoop Application Architectures • Free Training: – Coursera and Edx • My favorite is a Python specialization series – learn.mapr.com • Free courses from 100 level to 400 level
  • 21.
    © 2016 MapRTechnologies 21© 2016 MapR Technologies 21 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies