Introduction to Alluxio (formerly Tachyon) and
how it brings up to 300x performance
improvement to Qunar’s streaming processing
Yupeng Fu, Alluxio Inc.
Xueyan Li, Qunar Inc.
May 2017
ABOUT US
• Yupeng Fu (yupeng9@github)
• Software Engineer @ Alluxio, Inc.
• Alluxio PMC member
• Worked at Palantir, Google
• Xueyan Li (astralidea@github)
• Software Engineer @ Qunar, Inc.
• Alluxio contributor
2
HISTORY
• Started at UC Berkeley AMPLab In Summer 2012
• Originally named as Tachyon
• Rebranded to Alluxio in early 2016
• Open Sourced in 2013
• Apache License 2.0
• Latest Stable Release: Alluxio 1.4.0 (Jan 2017)
• Alluxio 1.5.0-RC1 just cut
3
FASTEST-GROWING BIG DATA PROJECT
• Fastest growing open-
source project in the
big data ecosystem
• 500+ contributors from
100+ organizations
• Running world’s
largest production
clusters
4
DATA ECOSYSTEM YESTERDAY
• One Compute
Framework
• Single Storage
System
• Co-located
5
DATA ECOSYSTEM TODAY
…
• Many Compute
Frameworks
• Multiple Storage
Systems
• Most not co-located
…
6
DATA ECOSYSTEM ISSUES
• Each application
manage multiple data
sources
• Add/Removing data
sources require
application changes
• Storage optimizations
requires application
change
• Lower performance
due to lack of locality
…
…
7
DATA ECOSYSTEM WITH ALLUXIO
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in
Memory
• No Lock in
Native  File  
System
Hadoop Compatible  
File  System
Native  Key-­Value  
Interface
Fuse  Compatible  
File  System
HDFS  Interface
Amazon  S3  
Interface
Swift  Interface
GlusterFS
Interface
…
…
8
WHY ALLUXIO
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
9
ALLUXIO BENEFITS
Unification
New workflows across
any data in any
storage system
Orders of magnitude
improvement in run
time
Choice in compute
and storage – grow
each independently,
buy only what is
needed
Performance Flexibility
10
ALLUXIO DEPLOYMENTS
11
ALLUXIO USE CASES
On-Demand Analytics & Accelerating I/O to and from remote storage
Managing data across disparate storage systems
Sharing data across workloads at memory speed
Unify Data Analytics on Data from Geo-distributed Stores
One  of  Top  3
IT  Vendors
12
ON-DEMAND ANALYTICS &
ACCELERATE I/O TO/FROM REMOTE STORAGE
“The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit local
or remote Alluxio nodes, it took 10-15
seconds.
RESULTS
• Data queries are now 30x faster with
Alluxio
• Alluxio cluster runs stably, providing
over 50TB of RAM space
• By using Alluxio, batch queries usually
lasting over 15 minutes were
transformed into an interactive query
taking less than 30 seconds
PMs run interactive queries to gain
insights into their products & business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
13
SHARE DATA ACROSS JOBS @ MEMORY SPEED
“Thanks to Alluxio, we now have the raw
data immediately available at every iteration
& can skip the costs of loading in terms of
time waiting, network traffic, and RDBMS
activity.
RESULTS
• Barclays workflow iteration time
decreased from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O
cost of loading and storing in Alluxio is
now on the order of seconds
Barclays uses query & machine learning to
train models for risk management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIOALLUXIO
Relational Database:
Teradata
14
ONE OF TOP 3 IT VENDORS:
UNIFY DATA ANALYTICS ON DATA FROM GEO-DISTRIBUTED STORES
“Alluxio has enabled us to get valuable
insights into all our data as opposed to just a
subset
- VP of Analytics
RESULTS
• Alluxio Unified Global Namespace
enabled access of data from stores in
different data centers without the need
for ETL
• Enables Insights into business that was
otherwise not possible due to ETL
restrictions on data
Analysts at a Major global IT company run
analytics on WW data
• 10 Data Centers across different geo-
regions in the world: North America,
Europe, and Asia
ALLUXIO
Europe Asia
15
MANAGE DATA ACROSS STORAGE SYSTEMS
“We’ve been running in production for over 1
year, Alluxio’s enabled different applications
& frameworks to easily interact with data
from different storage systems
RESULTS
• Data sharing among Spark Streaming,
Spark batch and Flink jobs provide
efficient data sharing
• Improved the performance of their
system with 15x – 300x speedups
• Tiered storage feature manages storage
resources including memory, SSD and
disk
Qunar uses real-time machine learning for
their website ads
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
16
About  Qunar
400
0QP
S
Pric
e  
Data
4T
500
G
Raw  messageSensitive  data Daily  data  volume After  compression
• Leading travel search engine and information provider in China
• 75 million monthly visitors and 34 million activated mobile app users
• Real-time data processing platform
• Alluxio in production over a year
17
Hotel  Quotation  Pricing  System
01
02
03
04
1
2
3
4
Analyst/PM/Operatio
ns
Business  Products
Direct  queries
Price  center  
Monitor
Real-time  /  off-line  
model  training
Raw  Data
Collection
Data  Extraction
And  Cleaning
Data  Compression
And  Conversion
Pricing  Info
18
Platform  Architecture  -­ Before
Issues
• Slow Remote HDFS
• Repetitive disk read
• Spark executor restart
• Spark GC and OOM
19
Improving  the  Architecture  with  Alluxio
20
Platform  Architecture  in  a  nutshell
Compute
Storage
Resource  
Manager
HDFS HDFS Ceph
21
Benefits  with  Alluxio
02
04
01
03
Management  of  the  local  storage,  including  memory,  SSD  and  disk  constitute  a  
hierarchical  storage  layer.
Simple  API  and  easy  integration
Reduce  GC  overhead   and  when  a  Spark  executor  fails  to  exit,  the  calculated  data  
will  not  be  lost  due  to  the  "drifting"  of  the  executor.
Zeppelin,  Flink,  Spark,  MapReduce,  can  share  data  at  memory-
speed.
Unified  namespace
Data  sharing  among  compute  frameworks
Unifies  the  HDFS  clusters  and  other  storage  systems.
Tiered  
storage
Write  app  once  and  work  with  multiple  storage  systems
05 Spark  off-­heap  storage
22
Benefits  with  Alluxio
On  average  15X  faster!
300x  faster  at  peak  time!
23
• Team consists of Alluxio creators and top committers
• Invested by
• Committed to Alluxio Open Source
• http://www.alluxio.com
Alluxio Inc.
We are hiring!
24
Contact: yupeng@alluxio.com
Twitter: @Alluxio
Websites: www.alluxio.com and www.alluxio.org
Thank you!
Demo:
Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis
Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE
25
Tiered storage separates cold and hot data
MEM
SSD
HDD
Most  of  the  data  in  a  hotspot  will  only  be  used  
for  the  day's  results.
We  deployed  Alluxio Worker  on  each  compute  
node  and  managed  the  local  storage  media,  
including  memory,  SSDs  and  disks,  to  form  a  
hierarchical  storage  tier.  Each  node  upstream  
computing  related  data  will  be  stored  in  the  local  
as  much  as  possible,  to  avoid  consumption  of  
network  resources.  At  the  same  time,  Alluxio  itself  
provides  LRU,  LFU  and  other  efficient  
replacement  strategy  to  ensure  that  the  hot  data  
is  located  in  the  faster  memory  layer  to  improve  
the  data  access  rate;  even  the  cold  data  is  stored  
in  the  local  disk,  avoiding  having  to  access  26

Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar's streaming processing

  • 1.
    Introduction to Alluxio(formerly Tachyon) and how it brings up to 300x performance improvement to Qunar’s streaming processing Yupeng Fu, Alluxio Inc. Xueyan Li, Qunar Inc. May 2017
  • 2.
    ABOUT US • YupengFu (yupeng9@github) • Software Engineer @ Alluxio, Inc. • Alluxio PMC member • Worked at Palantir, Google • Xueyan Li (astralidea@github) • Software Engineer @ Qunar, Inc. • Alluxio contributor 2
  • 3.
    HISTORY • Started atUC Berkeley AMPLab In Summer 2012 • Originally named as Tachyon • Rebranded to Alluxio in early 2016 • Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.4.0 (Jan 2017) • Alluxio 1.5.0-RC1 just cut 3
  • 4.
    FASTEST-GROWING BIG DATAPROJECT • Fastest growing open- source project in the big data ecosystem • 500+ contributors from 100+ organizations • Running world’s largest production clusters 4
  • 5.
    DATA ECOSYSTEM YESTERDAY •One Compute Framework • Single Storage System • Co-located 5
  • 6.
    DATA ECOSYSTEM TODAY … •Many Compute Frameworks • Multiple Storage Systems • Most not co-located … 6
  • 7.
    DATA ECOSYSTEM ISSUES •Each application manage multiple data sources • Add/Removing data sources require application changes • Storage optimizations requires application change • Lower performance due to lack of locality … … 7
  • 8.
    DATA ECOSYSTEM WITHALLUXIO • Apps only talk to Alluxio • Simple Add/Remove • No App Changes • Highest performance in Memory • No Lock in Native  File   System Hadoop Compatible   File  System Native  Key-­Value   Interface Fuse  Compatible   File  System HDFS  Interface Amazon  S3   Interface Swift  Interface GlusterFS Interface … … 8
  • 9.
    WHY ALLUXIO Co-located computeand data with memory-speed access to data Virtualized across different storage systems under a unified namespace Scale-out architecture File system API, software only 9
  • 10.
    ALLUXIO BENEFITS Unification New workflowsacross any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility 10
  • 11.
  • 12.
    ALLUXIO USE CASES On-DemandAnalytics & Accelerating I/O to and from remote storage Managing data across disparate storage systems Sharing data across workloads at memory speed Unify Data Analytics on Data from Geo-distributed Stores One  of  Top  3 IT  Vendors 12
  • 13.
    ON-DEMAND ANALYTICS & ACCELERATEI/O TO/FROM REMOTE STORAGE “The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds PMs run interactive queries to gain insights into their products & business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 13
  • 14.
    SHARE DATA ACROSSJOBS @ MEMORY SPEED “Thanks to Alluxio, we now have the raw data immediately available at every iteration & can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. RESULTS • Barclays workflow iteration time decreased from hours to seconds • Alluxio enabled workflows that were impossible before • By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds Barclays uses query & machine learning to train models for risk management • 6 node deployment • 1TB of storage • Memory only ALLUXIOALLUXIO Relational Database: Teradata 14
  • 15.
    ONE OF TOP3 IT VENDORS: UNIFY DATA ANALYTICS ON DATA FROM GEO-DISTRIBUTED STORES “Alluxio has enabled us to get valuable insights into all our data as opposed to just a subset - VP of Analytics RESULTS • Alluxio Unified Global Namespace enabled access of data from stores in different data centers without the need for ETL • Enables Insights into business that was otherwise not possible due to ETL restrictions on data Analysts at a Major global IT company run analytics on WW data • 10 Data Centers across different geo- regions in the world: North America, Europe, and Asia ALLUXIO Europe Asia 15
  • 16.
    MANAGE DATA ACROSSSTORAGE SYSTEMS “We’ve been running in production for over 1 year, Alluxio’s enabled different applications & frameworks to easily interact with data from different storage systems RESULTS • Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing • Improved the performance of their system with 15x – 300x speedups • Tiered storage feature manages storage resources including memory, SSD and disk Qunar uses real-time machine learning for their website ads • 200+ nodes deployment • 6 billion logs (4.5 TB) daily • Mix of Memory + HDD ALLUXIO 16
  • 17.
    About  Qunar 400 0QP S Pric e   Data 4T 500 G Raw messageSensitive  data Daily  data  volume After  compression • Leading travel search engine and information provider in China • 75 million monthly visitors and 34 million activated mobile app users • Real-time data processing platform • Alluxio in production over a year 17
  • 18.
    Hotel  Quotation  Pricing System 01 02 03 04 1 2 3 4 Analyst/PM/Operatio ns Business  Products Direct  queries Price  center   Monitor Real-time  /  off-line   model  training Raw  Data Collection Data  Extraction And  Cleaning Data  Compression And  Conversion Pricing  Info 18
  • 19.
    Platform  Architecture  -­Before Issues • Slow Remote HDFS • Repetitive disk read • Spark executor restart • Spark GC and OOM 19
  • 20.
    Improving  the  Architecture with  Alluxio 20
  • 21.
    Platform  Architecture  in a  nutshell Compute Storage Resource   Manager HDFS HDFS Ceph 21
  • 22.
    Benefits  with  Alluxio 02 04 01 03 Management of  the  local  storage,  including  memory,  SSD  and  disk  constitute  a   hierarchical  storage  layer. Simple  API  and  easy  integration Reduce  GC  overhead   and  when  a  Spark  executor  fails  to  exit,  the  calculated  data   will  not  be  lost  due  to  the  "drifting"  of  the  executor. Zeppelin,  Flink,  Spark,  MapReduce,  can  share  data  at  memory- speed. Unified  namespace Data  sharing  among  compute  frameworks Unifies  the  HDFS  clusters  and  other  storage  systems. Tiered   storage Write  app  once  and  work  with  multiple  storage  systems 05 Spark  off-­heap  storage 22
  • 23.
    Benefits  with  Alluxio On average  15X  faster! 300x  faster  at  peak  time! 23
  • 24.
    • Team consistsof Alluxio creators and top committers • Invested by • Committed to Alluxio Open Source • http://www.alluxio.com Alluxio Inc. We are hiring! 24
  • 25.
    Contact: [email protected] Twitter: @Alluxio Websites:www.alluxio.com and www.alluxio.org Thank you! Demo: Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE 25
  • 26.
    Tiered storage separatescold and hot data MEM SSD HDD Most  of  the  data  in  a  hotspot  will  only  be  used   for  the  day's  results. We  deployed  Alluxio Worker  on  each  compute   node  and  managed  the  local  storage  media,   including  memory,  SSDs  and  disks,  to  form  a   hierarchical  storage  tier.  Each  node  upstream   computing  related  data  will  be  stored  in  the  local   as  much  as  possible,  to  avoid  consumption  of   network  resources.  At  the  same  time,  Alluxio  itself   provides  LRU,  LFU  and  other  efficient   replacement  strategy  to  ensure  that  the  hot  data   is  located  in  the  faster  memory  layer  to  improve   the  data  access  rate;  even  the  cold  data  is  stored   in  the  local  disk,  avoiding  having  to  access  26