Spark and Hadoop
Perfect Together
Arun Murthy
Hortonworks Co-Founder
@acmurthy
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Operating System
Enable all data and applications
TO BE
accessible and shared
BY
any end-users
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Operating System
YARN: data operating system
Governance Security
Operations
Resource management
Data access: batch, interactive, real-time
Storage
Commodity Appliance Cloud
Built on a centralized architecture of
shared enterprise services:
Scalable tiered storage
Resource and workload management
Trusted data governance & metadata management
Consistent operations
Comprehensive security
Developer APIs and tools
Hadoop/YARN-powered data operating system
100% open source, multi-tenant data platform for
any application, any data set, anywhere.
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Let’s Talk Real-World Use-Cases
Page6 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DATA FLOW
MANAGEMENT
Streaming & Machine Learning for Web Analytics
USE CASE:
Cost: Storage and
processing not economical
at scale
Silos: Separate clusters —
Hadoop & Spark
Analytics: Retroactive view
limited predictive
capabilities
CHALLENGE
Spark Streaming for
ingesting events in real-
time
SparkML for predictive
analytics
SOLUTION
Cost: Hardware costs
reduced 25-50%
YARN: Shared cluster for
Spark, MapReduce, Hive
etc.
Analytics: Processes 10
billion events daily at 20
milliseconds per event
IMPACT
HDP
Page7 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DATA FLOW
MANAGEMENT
Claims Re-imbursement Processing with Spark
USE CASE: INSURANCE
Overwhelmed by data
ingest rates
Team expertise in R
Lots of key features like
textual features not
incorporated
CHALLENGE
Use Spark to optimize
claims reimbursements
process
Leverage Spark’s machine
learning capabilities to
process and analyze all
claims.
SOLUTION
Insurance companies can
now process all claims and
detect over and under
payment
More accurate over and
under payment detection
IMPACT
HDP
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
On-going Customer Use Cases with Spark
HealthCare
Entity Resolution
• Different feeds represent entities in different
forms
– need to identify same entities
• No standard packages available to do the
job.
• Additional features need to be derived to
expand context and disambiguate
Requirement
• Extensible entity resolution framework
– mix supervised and unsupervised learning
John Smith J. Smith
A B
E
DC
H
GF
name name
location location
=
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
On-going Customer Use Cases with Spark
Finance
Efficiently bring HBase Data into
Spark
• HBase is the operational store of record
• End user Analytics via Spark
Requirement:
• Efficient scans via predicate pushdown
via co-processors ° ° ° ° ° ° ° ° °HDFS
YARN
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark + Hadoop - The Road Ahead
Innovate at the Core
Seamless Data Access
Data Science Acceleration
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Sources
Data
Science
Acceleration
Data Science ToolsNotebook
Data Science Libraries
Apache Hadoop, Hive and HBase
InnovateattheCore
Apache Spark
Innovate at the Core
Storage
– RDD Sharing with HDFS Memory Tier
Resource Management
– Spark on YARN improvements
Security
– Enhance SparkSQL Security
– Enhance Wire Encryption
Governance
– Integration with Atlas
Operations
– Ambari to Support Multiple Spark Versions
Zeppelin
Governance
Security
Operations
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° NHDFS
YARN
SQL
Hive
NoSQL
Hbase
Enterprise Ready
Seamless
DataAccess
Spark + Hadoop - The Road Ahead
ORC FileNiFiHBase Custom
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Seamless Data Access
Data Access
– Hive ORC
– Hbase Connector
– NiFi Streams
Spark + Hadoop - The Road Ahead
Data Sources
Data
Science
Acceleration
Data Science ToolsNotebook
Data Science Libraries
Apache Hadoop, Hive and HBase
InnovateattheCore
Apache Spark
Zeppelin
Governance
Security
Operations
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° NHDFS
YARN
SQL
Hive
NoSQL
Hbase
Enterprise Ready
Seamless
DataAccess
ORC FileNiFiHBase Custom
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Science Acceleration
Apache Zeppelin
– Spark development and visualization
Data Science Libraries
– Jump start tools leveraging Spark Machine
Learning (ie Magellan, Entity Resolution)
Core Machine Learning
– One vs Rest classifiers, multi-class
classifiers
– Serializing Machine Learning pipelines
– Data set loader for public data sets
Spark + Hadoop - The Road Ahead
Data Sources
Data
Science
Acceleration
Data Science ToolsNotebook
Data Science Libraries
Apache Hadoop, Hive and HBase
InnovateattheCore
Apache Spark
Zeppelin
Governance
Security
Operations
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° NHDFS
YARN
SQL
Hive
NoSQL
Hbase
Enterprise Ready
Seamless
DataAccess
ORC FileNiFiHBase Custom
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Open Web-based Notebook
for interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Geospatial Insight
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Magellan on Spark Packages
What
– Brings GeoSpatial Analytics to Big Data powered by Spark
– Available at Spark Packages
http://spark-packages.org/package/harsha2010/magellan
Key Features
– Parse geospatial data and metadata into Shapes + Metadata
– Python and Scala support
– Efficient Geometric Queries
- ESRI Hive Library
- simple and intuitive syntax
– Scalable implementations of common algorithms
Learn More
– Magellan Blog: http://tinyurl.com/magellanBlog
Find Sample Magellan Notebook:
http://tinyurl.com/zeppelinNotebooks
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark + Hadoop - The Road Ahead
Innovate at the Core
Seamless Data Access
Data Science Acceleration
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Magellan Walkthrough
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Maximizing Revenue for UBER drivers
Data Insight Opportunity
– Uber publishes anonymized GeoSpatial trip data
– City of San Francisco has an active Open Data program
- demographics, neighborhoods, traffic
Challenges
– What neighborhood should a driver hangout to maximize revenue?
Solution
– Leverage Spark to transform and aggregate data
– Use Magellan to do GeoSpatial Queries
uber-magellan-nb
uber-magellan-nb
uber-magellan-nb
uber-magellan-nb
Hortonworks Data Platform
Perfect Together
+ Hadoop

Spark Summit EMEA - Arun Murthy's Keynote

  • 1.
    Spark and Hadoop PerfectTogether Arun Murthy Hortonworks Co-Founder @acmurthy © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 2.
    Data Operating System Enableall data and applications TO BE accessible and shared BY any end-users
  • 3.
    Page3 © HortonworksInc. 2011 – 2015. All Rights Reserved Data Operating System YARN: data operating system Governance Security Operations Resource management Data access: batch, interactive, real-time Storage Commodity Appliance Cloud Built on a centralized architecture of shared enterprise services: Scalable tiered storage Resource and workload management Trusted data governance & metadata management Consistent operations Comprehensive security Developer APIs and tools Hadoop/YARN-powered data operating system 100% open source, multi-tenant data platform for any application, any data set, anywhere.
  • 4.
    Page4 © HortonworksInc. 2011 – 2015. All Rights Reserved Elegant Developer APIs DataFrames, Machine Learning, and SQL Made for Data Science All apps need to get predictive at scale and fine granularity Democratize Machine Learning Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop Community Broad developer, customer and partner interest Realize Value of Data Operating System A key tool in the Hadoop toolbox Why We Love Spark at Hortonworks YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  • 5.
    Page5 © HortonworksInc. 2011 – 2015. All Rights Reserved Let’s Talk Real-World Use-Cases
  • 6.
    Page6 © HortonworksInc. 2011 – 2015. All Rights ReservedPage6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATA FLOW MANAGEMENT Streaming & Machine Learning for Web Analytics USE CASE: Cost: Storage and processing not economical at scale Silos: Separate clusters — Hadoop & Spark Analytics: Retroactive view limited predictive capabilities CHALLENGE Spark Streaming for ingesting events in real- time SparkML for predictive analytics SOLUTION Cost: Hardware costs reduced 25-50% YARN: Shared cluster for Spark, MapReduce, Hive etc. Analytics: Processes 10 billion events daily at 20 milliseconds per event IMPACT HDP
  • 7.
    Page7 © HortonworksInc. 2011 – 2015. All Rights ReservedPage7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATA FLOW MANAGEMENT Claims Re-imbursement Processing with Spark USE CASE: INSURANCE Overwhelmed by data ingest rates Team expertise in R Lots of key features like textual features not incorporated CHALLENGE Use Spark to optimize claims reimbursements process Leverage Spark’s machine learning capabilities to process and analyze all claims. SOLUTION Insurance companies can now process all claims and detect over and under payment More accurate over and under payment detection IMPACT HDP
  • 8.
    Page8 © HortonworksInc. 2011 – 2015. All Rights Reserved On-going Customer Use Cases with Spark HealthCare Entity Resolution • Different feeds represent entities in different forms – need to identify same entities • No standard packages available to do the job. • Additional features need to be derived to expand context and disambiguate Requirement • Extensible entity resolution framework – mix supervised and unsupervised learning John Smith J. Smith A B E DC H GF name name location location =
  • 9.
    Page9 © HortonworksInc. 2011 – 2015. All Rights Reserved On-going Customer Use Cases with Spark Finance Efficiently bring HBase Data into Spark • HBase is the operational store of record • End user Analytics via Spark Requirement: • Efficient scans via predicate pushdown via co-processors ° ° ° ° ° ° ° ° °HDFS YARN
  • 10.
    Page10 © HortonworksInc. 2011 – 2015. All Rights Reserved Spark + Hadoop - The Road Ahead Innovate at the Core Seamless Data Access Data Science Acceleration
  • 11.
    Page11 © HortonworksInc. 2011 – 2015. All Rights Reserved Data Sources Data Science Acceleration Data Science ToolsNotebook Data Science Libraries Apache Hadoop, Hive and HBase InnovateattheCore Apache Spark Innovate at the Core Storage – RDD Sharing with HDFS Memory Tier Resource Management – Spark on YARN improvements Security – Enhance SparkSQL Security – Enhance Wire Encryption Governance – Integration with Atlas Operations – Ambari to Support Multiple Spark Versions Zeppelin Governance Security Operations APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° NHDFS YARN SQL Hive NoSQL Hbase Enterprise Ready Seamless DataAccess Spark + Hadoop - The Road Ahead ORC FileNiFiHBase Custom
  • 12.
    Page12 © HortonworksInc. 2011 – 2015. All Rights Reserved Seamless Data Access Data Access – Hive ORC – Hbase Connector – NiFi Streams Spark + Hadoop - The Road Ahead Data Sources Data Science Acceleration Data Science ToolsNotebook Data Science Libraries Apache Hadoop, Hive and HBase InnovateattheCore Apache Spark Zeppelin Governance Security Operations APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° NHDFS YARN SQL Hive NoSQL Hbase Enterprise Ready Seamless DataAccess ORC FileNiFiHBase Custom
  • 13.
    Page13 © HortonworksInc. 2011 – 2015. All Rights Reserved Data Science Acceleration Apache Zeppelin – Spark development and visualization Data Science Libraries – Jump start tools leveraging Spark Machine Learning (ie Magellan, Entity Resolution) Core Machine Learning – One vs Rest classifiers, multi-class classifiers – Serializing Machine Learning pipelines – Data set loader for public data sets Spark + Hadoop - The Road Ahead Data Sources Data Science Acceleration Data Science ToolsNotebook Data Science Libraries Apache Hadoop, Hive and HBase InnovateattheCore Apache Spark Zeppelin Governance Security Operations APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° NHDFS YARN SQL Hive NoSQL Hbase Enterprise Ready Seamless DataAccess ORC FileNiFiHBase Custom
  • 14.
    Page14 © HortonworksInc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Open Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 15.
    Page15 © HortonworksInc. 2011 – 2015. All Rights Reserved Geospatial Insight Where do people go on weekends? Does usage pattern change with time? Predict the drop off point of a user? Predict the location where next pick up can be expected? Identify crime hotspots How do these hotspots evolve with time? Predict the likelihood of crime occurring at a given neighborhood Predict climate at fairly granular level Climate insurance: do I need to buy insurance for my crops? Climate as a factor in crime: Join climate dataset with Crimes
  • 16.
    Page16 © HortonworksInc. 2011 – 2015. All Rights Reserved Magellan on Spark Packages What – Brings GeoSpatial Analytics to Big Data powered by Spark – Available at Spark Packages http://spark-packages.org/package/harsha2010/magellan Key Features – Parse geospatial data and metadata into Shapes + Metadata – Python and Scala support – Efficient Geometric Queries - ESRI Hive Library - simple and intuitive syntax – Scalable implementations of common algorithms Learn More – Magellan Blog: http://tinyurl.com/magellanBlog Find Sample Magellan Notebook: http://tinyurl.com/zeppelinNotebooks
  • 17.
    Page17 © HortonworksInc. 2011 – 2015. All Rights Reserved Spark + Hadoop - The Road Ahead Innovate at the Core Seamless Data Access Data Science Acceleration
  • 18.
    Page18 © HortonworksInc. 2011 – 2015. All Rights Reserved Magellan Walkthrough
  • 19.
    Page19 © HortonworksInc. 2011 – 2015. All Rights Reserved Maximizing Revenue for UBER drivers Data Insight Opportunity – Uber publishes anonymized GeoSpatial trip data – City of San Francisco has an active Open Data program - demographics, neighborhoods, traffic Challenges – What neighborhood should a driver hangout to maximize revenue? Solution – Leverage Spark to transform and aggregate data – Use Magellan to do GeoSpatial Queries
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.

Editor's Notes

  • #2 NEED SPEAKER NOTES
  • #5 Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it.
  • #7 TALK TRACK MACHINE LEARNING WITH SPARK WEBTRENDS PROVIDES DIGITAL MARKETING ANALYTICS STATS + PROCESSES 10 BILLION EVENTS PER DAY + AT AN AVERAGE SPEED OF 20 MILLISECONDS PER EVENT [Back ground] Listen to Peter Crossley Hadoop Summit keynote(min 22:22): http://brightcove.fora.tv/services/player/bcpid4287593805001?bckey=AQ~~,AAACbMgRlRk~,KnD13XNmCDZZPWkNmxmPMFTH2h0USbHh&bclid=4287661488001&bctid=4285724101001 Read recent article on Web Trends by Peter Crossley: http://insidebigdata.com/2015/07/14/strategic-big-data-pivot-leads-webtrends-to-success/
  • #8 Overwhelmed by data ingest rates, reduce sample data to fit into edge node Random Forest models in R Team expertise in R and not Scala/ Java Lots of key features like textual features not incorporated (cannot handle feature blowup in R)
  • #9 HBase has efficient scans, however can Spark leverage it? Push predicates and prune columns
  • #10 HBase has efficient scans, however can Spark leverage it? Push predicates and prune columns
  • #12 Allow for RDD sharing with the HDFS Memory Tier. Improve dynamic resource allocation via YARN. Mature SparkSQL and Spark Streaming to GA quality. HDFS Memory Tier There are many use cases where Spark today feels less than ideal. For example, using Spark in a shared environment, with a middle tier fielding request from multiple tier is a common problem. SparkContext is a heavyweight object and is tied to a specific user session. Using Spark in shared environment requires features such as RDD sharing with HDFS memory tier and SparkContext sharing. There are other areas of improvement in Spark’s YARN integration. For example, today YARN applications logs are published when the application finishes running. This model does not work for Spark Streaming, which is a long running application and doesn’t get a chance to publish its logs. Spark’s YARN ATS integration also needs work to help it scale and not become a bottleneck. SparkSQL is another critical area where we want to add more value by bringing Hive level features such as security (SPARK-11265), ACID and vectorization features to SparkSQL and make it GA in our platform over the coming months.
  • #13 Seamless Data Access One Hive Seamless use of capabilities across Spark and Hive via SQL including common file formats. Deliver connectors for HBase (HFile). Spark is a great data processing engine, and it provides more value when it can process more data. The value of data lake is that it brings more data under one roof and opens new opportunity for insights and to drive efficiency. The data lake is delivered by YARN and Hadoop to provide massive scale and run all types of workload to take advantage of that data. With the DataSource API, Spark provides a first class way to bring data from external sources while leveraging these systems for filtering, predicate pushdown etc. We used DataSource API to bring ORC data into Spark. And now we are working to bring HBase data efficiently into processing with Spark. This will be different from existing Spark + HBase connector in that it will leverage HBase for predicate pushdown, column filtering and will be more efficient. Look for a tech preview of this feature in the coming months.
  • #14 Data science notebooks and automation for the most common analysis scenarios. Zeppelin Include support for GeoSpatial and Entity Resolution. Magellan
  • #15 TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  • #17 Hive ESRI Hive (thin wrapper on ESRI Java) Magellan available on Github (https://github.com/harsha2010/magellan) Can parse and understand most widely used formats GeoJSON, ESRIShapefile All geometries supported 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan) Broadcast join available for common scenarios Work in progress (targeted 1.0.4) Geohash Join optimization Map Matching algorithm using Markov Models Python and Scala support Please give it a try and give us feedback!
  • #20 One of our Insurance customers is using Spark to optimize the claims reimbursements and are using Spark’s machine learning capabilities to process and analyze all claims.
  • #25 We have seen rapid adoption of Spark in our customer base and we want to thank our customers for choosing Spark on HDP. We also want to thank our partners Microsoft, Databricks, HP, NFLabs and the community on sharing this journey with us.