Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar's streaming processing

Introduction to Alluxio (formerly Tachyon) and
how it brings up to 300x performance
improvement to Qunar’s streaming processing
Yupeng Fu, Alluxio Inc.
Xueyan Li, Qunar Inc.
May 2017

ABOUT US
• Yupeng Fu (yupeng9@github)
• Software Engineer @ Alluxio, Inc.
• Alluxio PMC member
• Worked at Palantir, Google
• Xueyan Li (astralidea@github)
• Software Engineer @ Qunar, Inc.
• Alluxio contributor
2

HISTORY
• Started at UC Berkeley AMPLab In Summer 2012
• Originally named as Tachyon
• Rebranded to Alluxio in early 2016
• Open Sourced in 2013
• Apache License 2.0
• Latest Stable Release: Alluxio 1.4.0 (Jan 2017)
• Alluxio 1.5.0-RC1 just cut
3

FASTEST-GROWING BIG DATA PROJECT
• Fastest growing open-
source project in the
big data ecosystem
• 500+ contributors from
100+ organizations
• Running world’s
largest production
clusters
4

DATA ECOSYSTEM YESTERDAY
• One Compute
Framework
• Single Storage
System
• Co-located
5

DATA ECOSYSTEM TODAY
…
• Many Compute
Frameworks
• Multiple Storage
Systems
• Most not co-located
…
6

DATA ECOSYSTEM ISSUES
• Each application
manage multiple data
sources
• Add/Removing data
sources require
application changes
• Storage optimizations
requires application
change
• Lower performance
due to lack of locality
…
…
7

DATA ECOSYSTEM WITH ALLUXIO
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in
Memory
• No Lock in
Native File
System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible
File System
HDFS Interface
Amazon S3
Interface
Swift Interface
GlusterFS
Interface
…
…
8

WHY ALLUXIO
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
9

ALLUXIO BENEFITS
Unification
New workflows across
any data in any
storage system
Orders of magnitude
improvement in run
time
Choice in compute
and storage – grow
each independently,
buy only what is
needed
Performance Flexibility
10

ALLUXIO USE CASES
On-Demand Analytics & Accelerating I/O to and from remote storage
Managing data across disparate storage systems
Sharing data across workloads at memory speed
Unify Data Analytics on Data from Geo-distributed Stores
One of Top 3
IT Vendors
12

ON-DEMAND ANALYTICS &
ACCELERATE I/O TO/FROM REMOTE STORAGE
“The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit local
or remote Alluxio nodes, it took 10-15
seconds.
RESULTS
• Data queries are now 30x faster with
Alluxio
• Alluxio cluster runs stably, providing
over 50TB of RAM space
• By using Alluxio, batch queries usually
lasting over 15 minutes were
transformed into an interactive query
taking less than 30 seconds
PMs run interactive queries to gain
insights into their products & business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
13

SHARE DATA ACROSS JOBS @ MEMORY SPEED
“Thanks to Alluxio, we now have the raw
data immediately available at every iteration
& can skip the costs of loading in terms of
time waiting, network traffic, and RDBMS
activity.
RESULTS
• Barclays workflow iteration time
decreased from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O
cost of loading and storing in Alluxio is
now on the order of seconds
Barclays uses query & machine learning to
train models for risk management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIOALLUXIO
Relational Database:
Teradata
14

ONE OF TOP 3 IT VENDORS:
UNIFY DATA ANALYTICS ON DATA FROM GEO-DISTRIBUTED STORES
“Alluxio has enabled us to get valuable
insights into all our data as opposed to just a
subset
- VP of Analytics
RESULTS
• Alluxio Unified Global Namespace
enabled access of data from stores in
different data centers without the need
for ETL
• Enables Insights into business that was
otherwise not possible due to ETL
restrictions on data
Analysts at a Major global IT company run
analytics on WW data
• 10 Data Centers across different geo-
regions in the world: North America,
Europe, and Asia
ALLUXIO
Europe Asia
15

MANAGE DATA ACROSS STORAGE SYSTEMS
“We’ve been running in production for over 1
year, Alluxio’s enabled different applications
& frameworks to easily interact with data
from different storage systems
RESULTS
• Data sharing among Spark Streaming,
Spark batch and Flink jobs provide
efficient data sharing
• Improved the performance of their
system with 15x – 300x speedups
• Tiered storage feature manages storage
resources including memory, SSD and
disk
Qunar uses real-time machine learning for
their website ads
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
16

About Qunar
400
0QP
S
Pric
e
Data
4T
500
G
Raw messageSensitive data Daily data volume After compression
• Leading travel search engine and information provider in China
• 75 million monthly visitors and 34 million activated mobile app users
• Real-time data processing platform
• Alluxio in production over a year
17

Hotel Quotation Pricing System
01
02
03
04
1
2
3
4
Analyst/PM/Operatio
ns
Business Products
Direct queries
Price center
Monitor
Real-time / off-line
model training
Raw Data
Collection
Data Extraction
And Cleaning
Data Compression
And Conversion
Pricing Info
18

Platform Architecture - Before
Issues
• Slow Remote HDFS
• Repetitive disk read
• Spark executor restart
• Spark GC and OOM
19

Improving the Architecture with Alluxio
20

Platform Architecture in a nutshell
Compute
Storage
Resource
Manager
HDFS HDFS Ceph
21

Benefits with Alluxio
02
04
01
03
Management of the local storage, including memory, SSD and disk constitute a
hierarchical storage layer.
Simple API and easy integration
Reduce GC overhead and when a Spark executor fails to exit, the calculated data
will not be lost due to the "drifting" of the executor.
Zeppelin, Flink, Spark, MapReduce, can share data at memory-
speed.
Unified namespace
Data sharing among compute frameworks
Unifies the HDFS clusters and other storage systems.
Tiered
storage
Write app once and work with multiple storage systems
05 Spark off-heap storage
22

Benefits with Alluxio
On average 15X faster!
300x faster at peak time!
23

• Team consists of Alluxio creators and top committers
• Invested by
• Committed to Alluxio Open Source
• http://www.alluxio.com
Alluxio Inc.
We are hiring!
24

Contact: yupeng@alluxio.com
Twitter: @Alluxio
Websites: www.alluxio.com and www.alluxio.org
Thank you!
Demo:
Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis
Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE
25

Tiered storage separates cold and hot data
MEM
SSD
HDD
Most of the data in a hotspot will only be used
for the day's results.
We deployed Alluxio Worker on each compute
node and managed the local storage media,
including memory, SSDs and disks, to form a
hierarchical storage tier. Each node upstream
computing related data will be stored in the local
as much as possible, to avoid consumption of
network resources. At the same time, Alluxio itself
provides LRU, LFU and other efficient
replacement strategy to ensure that the hot data
is located in the faster memory layer to improve
the data access rate; even the cold data is stored
in the local disk, avoiding having to access 26

Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar's streaming processing

More Related Content

What's hot

Similar to Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar's streaming processing

More from Alluxio, Inc.

Recently uploaded

Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x performance improvement to Qunar's streaming processing