Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

Getting started with Apache Spark and Alluxio for
blazingly fast analytics
Bin Fan | Founding Engineer @ Alluxio
Cloud-Data-Orchestration-Austin meetup
2019/08/15

Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver

The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software

Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Key Innovations

Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7

Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory

Co-located
Data stack journey and innovation paths
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Support Presto, Spark
and other computes
without app changes
Enable & accelerate
big data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
§ Typically compute-bound
clusters over 100% capacity
§ Compute & I/O need to be
scaled together even when
not needed
§ Compute & I/O can be
scaled independently but
I/O still needed on HDFS
which is expensive
1 2
3
4
5

Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage

Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto

Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in $46B data
driven Hedge Fund. Model
training in cloud for bursty
workloads
Data access was slow, costing
them $$ in compute cost and
lower modeler productivity
Solution –
With Alluxio, data access are 10-
30X faster
Impact –
Increased efficiency on training of
ML algorithm, lowered compute cost
and increased modeler productivity,
resulting in 14 day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund

Machine Learning Case Study
Challenge –
Gain end to end view of business
with large volume of data
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
SPARK
TERADATA
SPARK
TERADATA

Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS

Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2

Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master

Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store

Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master

Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store

Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file

Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)

Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)

Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3

Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage

Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process

CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process

CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data

Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process

• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process

Performance Tuning Tips
§ Data Locality
§ Ensure Spark Executor Locality
§ Ensure Spark Task Locality
§ Prioritize Locality
§ Load Balancing
§ Smaller Block Size
§ Tune Executor Number
§ DeterminsticHashingPolicy to load data from UFS
Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla

Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS

Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Spark and Alluxio in 5 minutes
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!

Questions
Slides will be available at slack channel (https://alluxio.io/slack)

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

More Related Content

What's hot

Similar to Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

More from Alluxio, Inc.

Recently uploaded

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics