Robert	Hryniewicz
Data	Advocate
Twitter:	@RobH8z
Email:				rhryniewicz@hortonworks.com
Apache	Spark	Crash	Course	
Hadoop	Summit	Tokyo	2016
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
• Background
• Spark	Overview
• Zeppelin	Overview
• Components	of	HDP
• Lab	~	45min
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Sources
à Internet	of	Anything	(IoAT)
– Wind	Turbines,	Oil	Rigs,	Cars
– Weather	Stations,	Smart	Grids
– RFID	Tags,	Beacons,	Wearables
à User	Generated	Content	(Web	&	Mobile)
– Twitter,	Facebook,	Snapchat,	YouTube
– Clickstream,	Ads,	User	Engagement
– Payments:	Paypal,	Venmo
44ZB	in	2020
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	“Big	Data”	Problem
à A	single	machine	cannot	process	or	even	store	all	the	data!
Problem
Solution
à Distribute	data	over	large	clusters
Difficulty
à How	to	split	work	across	machines?
à Moving	data	over	network	is	expensive
à Must	consider	data	&	network	locality
à How	to	deal	with	failures?
à How	to	deal	with	slow	nodes?
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Background
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
History	of	Hadoop &	Spark
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access	Rates
At	least	an	order	of	magnitude	difference	between	memory	and	hard	drive	/	network	speed
FAST slower slowest
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	developed	at	AMPLab
(University	of	California	Berkeley)
à Unified	data	processing	engine	that	
operates	across	varied	data	
workloads	and	platforms
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for	data	munging,	data	wrangling,	and	Machine	Learning	(ML)
à In-memory	computation	model	– Fast!
– Effective	for	iterative	computations	and	ML
à Machine	Learning
– Implementation	of	distributed	ML	algorithms
– Pipeline	API	(Spark	ML)
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Ecosystem
Spark	Core
Spark	SQL Spark	Streaming Spark	MLlib GraphX
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Spark	Basics
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Context
à Main	entry	point	for	Spark	functionality
à Represents	a	connection	to	a	Spark	cluster
à Represented	as		sc in	your	code	(in	Zeppelin)
What	is	it?
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Overview
à Spark	module	for	structured	data	processing	(e.g.	DB	tables,	JSON	files,	CSV)
à Three	ways	to	manipulate	data:
– DataFrames API
– SQL	queries
– Datasets	API
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
columns
à Conceptually	equivalent	to	a	table	in	relational	DB	or	
a	data	frame	in	R/Python
à API	available	in	Scala,	Java,	Python,	and	R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data	is	described	as	a	DataFrame
with	rows,	columns,	and	a	schema
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
CSVAvro
HIVE
Spark	SQL
Text
Col1 Col2 … … ColN
DataFrame
Column
Row
Created	from	Various	Sources
à DataFrames from	HIVE:
– Reading	and	writing	HIVE	tables
à DataFrames from	files:
– Built-in:	JSON,	JDBC,	ORC,	Parquet,	HDFS
– External	plug-in:	CSV,	HBASE,	Avro
JSON
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SQL	Context
à Entry	point	into	all	functionality	in	Spark	SQL
à All	you	need	is	SparkContext
val sqlContext = SQLContext(sc)
SQLContext
à Superset	of	functionality	provided	by	basic	SQLContext
– Read	data	from	Hive	tables
– Access	to	Hive	Functions	à UDFs
HiveContext
val hc = HiveContext(sc)
Use	when	your	
data	resides	in	
Hive
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Examples
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Setting	up	DataFrame API
val flightsDF = … ç Create from CSV, JSON, Hive etc.
Example:
val path = "examples/flights.json"
val flightsDF = sqlContext.read.json(path)
Create	a	DataFrame
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Setting	up	SQL	API
Register	a	Temporary	Table
flightsDF.registerTempTable("flights")
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Two	API	Examples:	DataFrame and	SQL	APIs
flightsDF.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5
SQL	API
DataFrame API
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Stream	Processing?
Batch	Processing
• Ability	to	process	and	analyze	data	at-rest	(stored	data)
• Request-based,	bulk	evaluation	and	short-lived	processing
• Enabler	for	Retrospective,	Reactive	and	On-demand	Analytics
Stream	Processing
• Ability	to	ingest,	process	and	analyze	data	in-motion	in	real- or	near-real-time
• Event	or	micro-batch	driven,	continuous	evaluation	and	long-lived	processing
• Enabler	for	real-time	Prospective,	Proactive	and	Predictive	Analytics	 for	Next	Best	
Action
Stream	Processing	 +		Batch	Processing	 =			All	Data	Analytics
real-time (now) historical (past)
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
24
Modern	Data	Applications	approach	to	Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Extension	of	Spark	Core	API
à Stream	processing	of	live	data	streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Discretized	Streams	(DStreams)
à High-level	abstraction	representing	continuous	stream	of	data
à Internally	represented	as	a	sequence	of	RDDs
à Operation	applied	on	a	DStream translates	to	operations	on	the	underlying	RDDs
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Example:	flatMap operation
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Apply	transformations	over	a	sliding	window	of	data,	e.g.	rolling	average
Window	Operations
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	MLlib
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Where Can We Use Machine Learning (Data Science)
Healthcare
• Predict	diagnosis
• Prioritize	screenings
• Reduce	re-admittance	rates
Financial	services
• Fraud	Detection/prevention
• Predict	underwriting	risk
• New	account	risk	screens
Public	Sector
• Analyze	public	sentiment
• Optimize	resource	allocation
• Law	enforcement	&	security	
Retail
• Product	recommendation
• Inventory	management
• Price	optimization
Telco/mobile
• Predict	customer	churn
• Predict	equipment	failure
• Customer	behavior	analysis
Oil	&	Gas
• Predictive	maintenance
• Seismic	data	management
• Predict	well	production	levels
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression Model Training (one feature)
Coefficients:	2.81				Intercept:	3.05
y	=	2.81x	+	3.05
Training
Result
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark API for building ML pipelines
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline	Model
Train
Predict
Export	Model
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	GraphX
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
GraphX
à Page	Rank
à Topic	Modeling	(LDA)
à Community	Detection
Source:	ampcamp.berkeley.edu
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	&	HDP	Sandbox
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What is a Note/Notebook?
• A	web	based	GUI	for	small	code	snippets
• Write	code	snippets	in	browser
• Zeppelin	sends	code	to	backend	for	execution
• Zeppelin	gets	data	back	from	backend
• Zeppelin	visualizes	data
• Zeppelin	Note	=	Set	of	(Paragraphs/Cells)
• Other	Features	- Sharing/Collaboration/Reports/Import/Export
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big	Data	Lifecycle
Collect
ETL	/
Process
Analysis
Report
Data
Product
Business	user
Customer
Data	ScientistData	Engineer
All	in	one	place	in	Zeppelin!
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
How	does	Zeppelin	work?
Notebook	
Author
Collaborators/
Report	viewers
Zeppelin
Cluster
Spark	|	Hive	|	HBase
Any	of	30+	back	ends
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDP	Sandbox
What’s	included	in	the	HDP	Sandbox?
à Zeppelin
à Spark
à YARN	à Resource	Management
à HDFS	à Distributed	Storage	Layer
à And	many	more	components: Hive,	Solr etc. YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark	on	YARN?
à Resource	management	
– Share	Spark	workloads	with	other	
workloads	(HIVE,	Solr,	etc.)
à Utilizes	existing	HDP	cluster	
infrastructure
à Scheduling	and	queues
Spark	Driver
Client
Spark
Application	Master
YARN	container
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide	files	into	big	blocks	and	distribute	3	copies	randomly across	the	cluster
• Processing	Data	Locality
• Not	Just	storage	but	computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS	EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks	Data	Platform	2.4.x
Deployment	ChoiceLinux	 Windows	 On-Premise	 Cloud
HDFS Hadoop Distributed File System
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Data	Cloud
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Bringing	Multitenancy	to	Apache	Zeppelin
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Introducing	Livy
à Livy	is	the	open	source	REST	interface	for	interacting	with	Apache	Spark	from	anywhere	
à Installed	as	Spark	Ambari Service
Livy Client
HTTP HTTP	(RPC)
Spark	Interactive	Session
SparkContext
Spark	Batch	Session
SparkContext
Livy Server
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Security	Across	Zeppelin-Livy-Spark
Shiro
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos
Livy	APIs
Spark	on	YARN
Zeppelin
Driver
LDAP
Livy Server
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reasons	to	Integrate	with	Livy
à Bring	Sessions	to	Apache	Zeppelin
– Isolation
– Session	sharing	
à Enable	efficient	cluster	resource	utilization
– Default	Spark	interpreter	keeps	YARN/Spark	job	running	forever
– Livy	interpreter	recycled	after	60	minutes	of	inactivity	
(controlled	by	livy.server.session.timeout )
à To	Identity	Propagation
– Send	user	identity	from	Zeppelin		>	Livy		>	Spark	on	YARN
55 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy Server
SparkContext	Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client	1
Client	2
Client	3
Session-1
Session-1
Session-2
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample	Architecture
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Managed	Dataflow
SOURCES
REGIONAL	
INFRASTRUCTURE
CORE	
INFRASTRUCTURE
58 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
High-Level	Overview
IoT Edge
(single	node)
IoT Edge
(single	node)
IoT Devices
IoT Devices
NiFi Hub Data	Broker
Column	
DB
Data	
Store
Live	Dashboard
Data	Center
(on	prem/cloud)
HDFS/S3 HBase/Cassandra
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	new	in	Spark	2.0
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	2.0
à API	Improvements
– SparkSession (spark)	– new	entry	point	 (Replaces	SQLContext and	HiveContext)
– Unified	DataFrame &	DataSet API	 (DataFrame à alias	for	DataSet[Row])
– Structured	Streaming/Continuous	Application		 (Concept	of	an	infinite	DataFrame)
– Temporary	Table	à Temporary	View
à Performance	Improvements
– Tungsten	Phase	2	- Multi	stage	code	gen
– ORC	&	Parquet	file	improvements
à Machine	Learning	
– ML	pipeline	the	new	API,	MLlib deprecated
– Distributed	R	algorithms	(GLM,	Naïve	Bayes,	K-Means,	Survival	Regression)
à SparkSQL
– More	SQL	support	(new	ANSI	SQL	parser,	subquery	support)
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	the	latest	at	Hortonworks?
à HDP	2.5
– Batch	Processing
à HDF	2.0
– Streaming	Apps
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Lab	Preview
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Lab	Setup	Instructions
http://tinyurl.com/hwx-spark-intro
Lab	Options
- Local	Sandbox	(8GB	RAM	memory	required):
- VirtualBox or	Vmware
- Amazon	AWS	Cloud:
- Hortonworks	Data	Cloud	
è Setup	info:	http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html
http://hortonworks.github.io/hdp-aws/index.html
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Community	Engagement
Participate now at: community.hortonworks.com©	Hortonworks	Inc.	2011	– 2015.	All	Rights	Reserved
9,500+
Registered	Users
21,000+
Answers
32,500+
Technical	Assets
One Website!
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
Read access for everyone, join to participate and be recognized
• Full	Q&A	Platform	(like	StackOverflow)
• Knowledge	Base	Articles
• Code	Samples	and	Repositories
Robert	Hryniewicz
E:	rhryniewicz@hortonworks.com
T:	@RobH8z
Thanks!

#HSTokyo16 Apache Spark Crash Course