#HSTokyo16 Apache Spark Crash Course

Robert Hryniewicz
Data Advocate
Twitter: @RobH8z
Email: rhryniewicz@hortonworks.com
Apache Spark Crash Course
Hadoop Summit Tokyo 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Background
• Spark Overview
• Zeppelin Overview
• Components of HDP
• Lab ~ 45min

Data Sources
Ã Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
Ã User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
44ZB in 2020

The “Big Data” Problem
Ã A single machine cannot process or even store all the data!
Problem
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive
Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Spark Background

History of Hadoop & Spark

Access Rates
At least an order of magnitude difference between memory and hard drive / network speed
FAST slower slowest

What Is Apache Spark?
Ã Apache open source project
originally developed at AMPLab
(University of California Berkeley)
Ã Unified data processing engine that
operates across varied data
workloads and platforms

Why Apache Spark?
Ã Elegant Developer APIs
– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã In-memory computation model – Fast!
– Effective for iterative computations and ML
Ã Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming Spark MLlib GraphX

Apache Spark Basics

Spark Context
Ã Main entry point for Spark functionality
Ã Represents a connection to a Spark cluster
Ã Represented as sc in your code (in Zeppelin)
What is it?

Spark SQL

Spark SQL Overview
Ã Spark module for structured data processing (e.g. DB tables, JSON files, CSV)
Ã Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API

DataFrames
Ã Distributed collection of data organized into named
columns
Ã Conceptually equivalent to a table in relational DB or
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema

DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created from Various Sources
Ã DataFrames from HIVE:
– Reading and writing HIVE tables
Ã DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
JSON

SQL Context
Ã Entry point into all functionality in Spark SQL
Ã All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
Ã Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions à UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive

Spark SQL Examples

Setting up DataFrame API
val flightsDF = … ç Create from CSV, JSON, Hive etc.
Example:
val path = "examples/flights.json"
val flightsDF = sqlContext.read.json(path)
Create a DataFrame

Setting up SQL API
Register a Temporary Table
flightsDF.registerTempTable("flights")

Two API Examples: DataFrame and SQL APIs
flightsDF.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API

Spark Streaming

What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
24
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…

Spark Streaming
Ã Extension of Spark Core API
Ã Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT

Spark Streaming

Spark Streaming
Discretized Streams (DStreams)
Ã High-level abstraction representing continuous stream of data
Ã Internally represented as a sequence of RDDs
Ã Operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming
Example: flatMap operation

Spark Streaming
Ã Apply transformations over a sliding window of data, e.g. rolling average
Window Operations

Spark MLlib

Where Can We Use Machine Learning (Data Science)
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels

Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...

Linear Regression Model Training (one feature)
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result

Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563

Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Spark GraphX

GraphX
Ã Page Rank
Ã Topic Modeling (LDA)
Ã Community Detection
Source: ampcamp.berkeley.edu

Apache Zeppelin & HDP Sandbox

What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more

What is a Note/Notebook?
• A web based GUI for small code snippets
• Write code snippets in browser
• Zeppelin sends code to backend for execution
• Zeppelin gets data back from backend
• Zeppelin visualizes data
• Zeppelin Note = Set of (Paragraphs/Cells)
• Other Features - Sharing/Collaboration/Reports/Import/Export

Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer
All in one place in Zeppelin!

How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends

HDP Sandbox
What’s included in the HDP Sandbox?
Ã Zeppelin
Ã Spark
Ã YARN à Resource Management
Ã HDFS à Distributed Storage Layer
Ã And many more components: Hive, Solr etc. YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Why Apache Spark on YARN?
Ã Resource management
– Share Spark workloads with other
workloads (HIVE, Solr, etc.)
Ã Utilizes existing HDP cluster
infrastructure
Ã Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System

Hortonworks Data Cloud

Bringing Multitenancy to Apache Zeppelin

Introducing Livy
Ã Livy is the open source REST interface for interacting with Apache Spark from anywhere
Ã Installed as Spark Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server

Security Across Zeppelin-Livy-Spark
Shiro
Ispark Group Interpreter
SPNego: Kerberos Kerberos
Livy APIs
Spark on YARN
Zeppelin
Driver
LDAP
Livy Server

Reasons to Integrate with Livy
Ã Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
Ã Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout )
Ã To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

Livy Server
SparkContext Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client 1
Client 2
Client 3
Session-1
Session-1
Session-2

Sample Architecture

Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE

High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on prem/cloud)
HDFS/S3 HBase/Cassandra

What’s new in Spark 2.0

Spark 2.0
Ã API Improvements
– SparkSession (spark) – new entry point (Replaces SQLContext and HiveContext)
– Unified DataFrame & DataSet API (DataFrame à alias for DataSet[Row])
– Structured Streaming/Continuous Application (Concept of an infinite DataFrame)
– Temporary Table à Temporary View
Ã Performance Improvements
– Tungsten Phase 2 - Multi stage code gen
– ORC & Parquet file improvements
Ã Machine Learning
– ML pipeline the new API, MLlib deprecated
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
Ã SparkSQL
– More SQL support (new ANSI SQL parser, subquery support)

What’s the latest at Hortonworks?
Ã HDP 2.5
– Batch Processing
Ã HDF 2.0
– Streaming Apps
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Modern Data Applications

Lab Preview

Lab Setup Instructions
http://tinyurl.com/hwx-spark-intro
Lab Options
- Local Sandbox (8GB RAM memory required):
- VirtualBox or Vmware
- Amazon AWS Cloud:
- Hortonworks Data Cloud
è Setup info: http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html

Hortonworks Community Connection

Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

Robert Hryniewicz
E: rhryniewicz@hortonworks.com
T: @RobH8z
Thanks!

#HSTokyo16 Apache Spark Crash Course

More Related Content

What's hot

Viewers also liked

Similar to #HSTokyo16 Apache Spark Crash Course

More from DataWorks Summit/Hadoop Summit

Recently uploaded

#HSTokyo16 Apache Spark Crash Course