Getting started with Apache Spark and Alluxio for
blazingly fast analytics
Bin Fan | Founding Engineer @ Alluxio
Cloud-Data-Orchestration-Austin meetup
2019/08/15
Alluxio Overview
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Key Innovations
Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7
Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
Production Use Cases
Co-located
Data stack journey and innovation paths
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Support Presto, Spark
and other computes
without app changes
Enable & accelerate
big data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
§ Typically compute-bound
clusters over 100% capacity
§ Compute & I/O need to be
scaled together even when
not needed
§ Compute & I/O can be
scaled independently but
I/O still needed on HDFS
which is expensive
1 2
3
4
5
Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto
Elastic Model Training
SPARK
HDFS
SPARK
HDFS
Challenge –
Algorithmic trading in $46B data
driven Hedge Fund. Model
training in cloud for bursty
workloads
Data access was slow, costing
them $$ in compute cost and
lower modeler productivity
Solution –
With Alluxio, data access are 10-
30X faster
Impact –
Increased efficiency on training of
ML algorithm, lowered compute cost
and increased modeler productivity,
resulting in 14 day ROI of Alluxio
Public Cloud
Public Cloud
Leading Hedge Fund
Machine Learning Case Study
Challenge –
Gain end to end view of business
with large volume of data
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
SPARK
TERADATA
SPARK
TERADATA
Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Using Spark with Alluxio
Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
Data Resilience During Crash
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process
Data Resilience During Crash
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
Performance Tuning Tips
§ Data Locality
§ Ensure Spark Executor Locality
§ Ensure Spark Task Locality
§ Prioritize Locality
§ Load Balancing
§ Smaller Block Size
§ Tune Executor Number
§ DeterminsticHashingPolicy to load data from UFS
Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla
Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS
Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Spark and Alluxio in 5 minutes
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!
Questions
Slides will be available at slack channel (https://alluxio.io/slack)

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

  • 1.
    Getting started withApache Spark and Alluxio for blazingly fast analytics Bin Fan | Founding Engineer @ Alluxio Cloud-Data-Orchestration-Austin meetup 2019/08/15
  • 2.
  • 3.
    Alluxio is Open-SourceData Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 4.
    The Alluxio Story Originatedas Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 5.
    Fast-growing Open SourceCommunity 4000+ Github Stars1000+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 6.
    Data Elasticity with aunified namespace Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data Data Accessibility for popular APIs & API translation Data Locality with Intelligent Multi-tiering Key Innovations
  • 7.
    Data Locality viaIntelligent Multi-tiering § Local performance from remote data using multi-tier storage RAM SSD HDD Hot Warm Cold Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL 9/13/19 7
  • 8.
    Spark Presto Bash Tensorflow Java ~$ cat /mnt/alluxio/myInput DataAccessibility via popular APIs > rdd = sc.textFile(“alluxio://master:19998/myInput”) > CREATE SCHEMA hive.web > WITH (location = 'alluxio://master:19998/my-table/') ~$ python classify_image.py --model_dir /mnt/fuse/imagenet/ FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 9.
    Data Abstraction viaUnified Namespace Enables effective data management across different Under Store $ ./bin/alluxio fs mount /Data s3://bucket/directory
  • 10.
  • 11.
    Co-located Data stack journeyand innovation paths Co-located compute & HDFS on the same cluster Disaggregated compute & HDFS on the same cluster MR / Hive HDFS Hive HDFS Disaggregated Burst HDFS data in the cloud, public or private Support Presto, Spark and other computes without app changes Enable & accelerate big data on object stores Transition to Object store HDFS for Hybrid Cloud Support more frameworks § Typically compute-bound clusters over 100% capacity § Compute & I/O need to be scaled together even when not needed § Compute & I/O can be scaled independently but I/O still needed on HDFS which is expensive 1 2 3 4 5
  • 12.
    Typical Use Cases CloudAnalytics Caching Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage Hybrid Cloud Analytics Get in-memory data access for Spark, Presto, or any analytics framework on Cloud storage
  • 13.
    Deployment Approaches Spark Alluxio Storage Co-locate AlluxioWorkers with Spark for optimal I/O performance Any Cloud Same instance / container Spark Alluxio Storage Deploy Alluxio as standalone cluster between Spark and Storage Any Cloud Same data center / region Presto
  • 14.
    Elastic Model Training SPARK HDFS SPARK HDFS Challenge– Algorithmic trading in $46B data driven Hedge Fund. Model training in cloud for bursty workloads Data access was slow, costing them $$ in compute cost and lower modeler productivity Solution – With Alluxio, data access are 10- 30X faster Impact – Increased efficiency on training of ML algorithm, lowered compute cost and increased modeler productivity, resulting in 14 day ROI of Alluxio Public Cloud Public Cloud Leading Hedge Fund
  • 15.
    Machine Learning CaseStudy Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W SPARK TERADATA SPARK TERADATA
  • 16.
    Analytics Use Case– Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh SPARK HDFS SPARK HDFS
  • 17.
    Alluxio MasterZookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio ReferenceArchitecture … … Application Application Under Store 1 Under Store 2
  • 18.
    Read data inAlluxio, on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 19.
    Read data notin Alluxio RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 20.
    Write data onlyto Alluxio on same node as client Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 21.
    Write data toAlluxio and Under Store synchronously RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 22.
  • 23.
    Accessing Alluxio DataFrom Spark Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 24.
    Code Example forSpark RDDs Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  • 25.
    Code Example forSpark DataFrames Writing to Alluxio df.write.parquet(alluxioPath) Reading from Alluxio df = sc.read.parquet(alluxioPath)
  • 26.
    Sharing Data viaMemory Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 27.
    Sharing Data viaMemory Storage Engine & Execution Engine Different process • Half the memory used • Sharing Data at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 28.
    Data Resilience DuringCrash Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 29.
    Data Resilience DuringCrash CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 30.
    Data Resilience DuringCrash CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 31.
    Data Resilience DuringCrash Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 32.
    Data Resilience DuringCrash • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 33.
    Performance Tuning Tips §Data Locality § Ensure Spark Executor Locality § Ensure Spark Task Locality § Prioritize Locality § Load Balancing § Smaller Block Size § Tune Executor Number § DeterminsticHashingPolicy to load data from UFS Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla
  • 34.
    Alluxio 2.0 &Coming in 2.1 Release § Alluxio 2.0: Released in July § Metadata scales to 1 bln file or more (based on rocksdb) § Self-managed Metadata service based on Quorum § Async writes, distributed load § Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/ § Alluxio 2.1: Scheduled in Sept § A Presto-Alluxio Connector with Iceberg Integration § Use Alluxio as a caching layer without modifying HMS
  • 35.
    Next steps -Try it out! • Getting Started • Try 10 Minutes Alluxio & Presto Tutorial on Laptop • Try 10 Minutes Alluxio & Presto Tutorial on AWS • Spark and Alluxio in 5 minutes • Tops 5 Performance tips running Presto on Alluxio Questions or Suggestions? Engage with us at alluxio.io/slack!
  • 36.
    Questions Slides will beavailable at slack channel (https://alluxio.io/slack)